FederatedAI / FATE

An Industrial Grade Federated Learning Framework
Apache License 2.0
5.73k stars 1.55k forks source link

homo-xgb 报错 ‘pika.exceptions.StreamLostError: Stream connection lost: ConnectionResetError(104, 'Connection reset by peer')’ #4751

Closed MDGBDGMG closed 4 months ago

MDGBDGMG commented 1 year ago

Describe the bug 两方四十万行数据集做homo-xgb,arbiter方报错‘pika.exceptions.StreamLostError: Stream connection lost: ConnectionResetError(104, 'Connection reset by peer')’

To Reproduce Steps to reproduce the behavior:

  1. conf文件 { "job_parameters":{ "common":{ "job_type":"train", "model_version":"202303311649128678570", "pulsar_run":{}, "auto_retries":0, "computing_engine":"SPARK", "model_id":"arbiter-1640046801#guest-1639995892#host-1639998474#model", "task_parallelism":1, "rabbitmq_run":{}, "engines_address":{}, "computing_partitions":48, "federated_status_collect_type":"PUSH", "inheritance_info":{}, "auto_retry_delay":1, "adaptation_parameters":{ "request_task_cores":16, "task_cores_per_node":8, "task_nodes":6, "if_initiator_baseline":true, "task_memory_per_node":0 }, "eggroll_run":{}, "spark_run":{ "driver-memory":"48G", "num-executors":6, "executor-cores":8, "executor-memory":"36G" }, "federated_mode":"MULTIPLE" } }, "component_parameters":{ "role":{ "host":{ "0":{ "reader_0":{ "table":{ "name":"train-1-450", "namespace":"train-1-450" } }, "data_transform_0":{ "output_format":"dense", "with_label":true } } }, "guest":{ "0":{ "reader_0":{ "table":{ "name":"train-1-450", "namespace":"train-1-450" } }, "data_transform_0":{ "output_format":"dense", "with_label":true } } } }, "common":{ "evaluation_0":{ "eval_type":"binary" }, "homo_secureboost_0":{ "num_trees":3, "validation_freqs":1, "objective_param":{ "objective":"cross_entropy" }, "task_type":"classification", "tree_param":{ "max_depth":3 } } } }, "dsl_version":2, "role":{ "arbiter":[ 1640046801 ], "host":[ 1639998474 ], "guest":[ 1639995892 ] }, "conf_path":"homo-xgb.conf", "initiator":{ "role":"guest", "party_id":1639995892 }, "dsl_path":"homo-xgb.dsl" }

  2. dsl文件 { "components":{ "reader_0":{ "output":{ "data":[ "data" ] }, "module":"Reader" }, "data_transform_0":{ "output":{ "data":[ "data" ], "model":[ "model" ] }, "input":{ "data":{ "data":[ "reader_0.data" ] } }, "module":"DataTransform" }, "evaluation_0":{ "output":{ "data":[ "data" ] }, "input":{ "data":{ "data":[ "homo_secureboost_0.data" ] } }, "module":"Evaluation" }, "homo_secureboost_0":{ "output":{ "data":[ "data" ], "model":[ "model" ] }, "input":{ "data":{ "train_data":[ "data_transform_0.data" ] } }, "module":"HomoSecureBoost" } } }

  3. 环境说明 fate 1.10 on spark,使用rabbitmq通信 partyA和partyB各自持有40万行450列数据集,partyC作为arbiter方,共同执行homo-xgb

跑其他作业没问题,比方40万数据集的psi和hetero-lr,跑这个homo-xgb就总是报这个错

  1. 报错日志 arbiter方报错 image

image

希望大佬帮忙解决一下,多谢

yinhang-e5b0b9e888aa commented 1 year ago

同问