FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
423 stars 221 forks source link

docker-compose方式部署文档 验证Serving-Service功能提交任务报错 #858

Open LemonGitMin opened 1 year ago

LemonGitMin commented 1 year ago

https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README_zh.md

image

root@2affb005c20b:/data/projects/fate# flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json { "jobId": "202302241041033508860", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {9999: {'data': {'job_id': '202302241041033508860'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB conf/service_conf.yaml#nodes conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with spark_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'host': {10000: {'data': {'job_id': '202302241041033508860'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB conf/service_conf.yaml#nodes conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with spark_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'arbiter': {10000: {'data': {'components': {'data_transform_0': {'need_run': False}, 'evaluation_0': {'need_run': True}, 'hetero_feature_binning_0': {'need_run': False}, 'hetero_feature_selection_0': {'need_run': False}, 'hetero_lr_0': {'need_run': True}, 'intersection_0': {'need_run': False}, 'reader_0': {'need_run': False}}}, 'retcode': 0, 'retmsg': 'success'}}})\n"

owlet42 commented 1 year ago

Please allocate enough resources to the cluster, you can modify compute_core=4 in parties.conf, change 4 to 8 or 16

LemonGitMin commented 1 year ago

谢谢,这个解决了,但是任务提交之后查看任务状态所有组件都是在waiting,也看不到任何日志,不知道是什么原因导致的? 过了几个小时查看状态,仍然所有组件都是在waiting状态

root@8d560c1a60d9:/data/projects/fate# flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json { "data": { "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202303010956352140720&role=guest&party_id=9999", "code": 0, "dsl_path": "/data/projects/fate/fateflow/jobs/202303010956352140720/job_dsl.json", "job_id": "202303010956352140720", "logs_directory": "/data/projects/fate/fateflow/logs/202303010956352140720", "message": "success", "model_info": { "model_id": "arbiter-10000#guest-9999#host-10000#model", "model_version": "202303010956352140720" }, "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202303010956352140720/pipeline_dsl.json", "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202303010956352140720/guest/9999/job_runtime_on_party_conf.json", "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202303010956352140720/job_runtime_conf.json", "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202303010956352140720/train_runtime_conf.json" }, "jobId": "202303010956352140720", "retcode": 0, "retmsg": "success" } root@8d560c1a60d9:/data/projects/fate# flow task query -r guest -j 202303010956352140720 | grep -w f_status "f_status": "waiting", "f_status": "waiting", "f_status": "waiting", "f_status": "waiting", "f_status": "waiting", "f_status": "waiting", "f_status": "waiting",

错误信息:是说向协调者申请资源失败了,请问那里可以跳过或者减少资源的申请 [ERROR] [2023-03-01 14:18:23,937] [202303011418228241090] [8:140543061313280] - [federated_scheduler.federated_command] [line:295]: failed to sending /party/202303011418228241090/arbiter/10000/resource/apply federated command on arbiter 10000 detail: 2 apply for job 202303011418228241090 resource failed

owlet42 commented 1 year ago

可能是其他job占用资源,可以结束对应的job释放资源。 job资源配置可以参考这个文档 https://github.com/FederatedAI/FATE/blob/master/doc/tutorial/dsl_conf/dsl_conf_v2_setting_guide.zh.md#4-%E7%B3%BB%E7%BB%9F%E8%BF%90%E8%A1%8C%E5%8F%82%E6%95%B0

summerghw commented 1 year ago

请问这个报错是因为sock被其他程序占用的原因吗? Traceback (most recent call last):\n File \"xxxeggroll/python/eggroll/core/client.py\", line 86, in sync_send\n response = _command_stub.call(request.to_proto())\n File \"/data/projects/python/venv/lib/python3.8/site-packages/grpc/_channel.py\", line 946, in __call__\n return _end_unary_response_blocking(state, call, False, None)\n File \"/data/projects/python/venv/lib/python3.8/site-packages/grpc/_channel.py\", line 849, in _end_unary_response_blocking\n raise _InactiveRpcError(state)\ngrpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Socket closed\"\n\tdebug_error_string = \"{\"created\":\"@1689756189.935578730\",\"description\":\"Error received from peer ipv4:xxx.xxx.xxx.xxx:41100\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Socket closed\",\"grpc_status\":14}\"\n>\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"xxxfateflow/python/fate_flow/worker/task_executor.py\", line 210, in _run_\n cpn_output = run_object.run(cpn_input)\n File \"xxx/federatedml/model_base.py\", line 239, in run\n self._run(cpn_input=cpn_input)\n File \"xxx/federatedml/model_base.py\", line 315, in _run\n this_data_output = func(*real_param)\n File \"xxx/federatedml/statistic/intersect/intersect_model.py\", line 269, in fit\n self.intersect_ids = self.intersection_obj.run_intersect(intersect_data)\n File \"xxx/federatedml/statistic/intersect/ecdh_intersect/ecdh_intersect_base.py\", line 123, in run_intersect\n id_intersect_cipher_cipher = self.get_intersect_doubly_encrypted_id(data_instances)\n File \"xxx/federatedml/statistic/intersect/ecdh_intersect/ecdh_intersect_guest.py\", line 69, in get_intersect_doubly_encrypted_id\n self.id_local_first = self._encrypt_id(data_instances,\n File \"xxx/federatedml/statistic/intersect/ecdh_intersect/ecdh_intersect_base.py\", line 79, in _encrypt_id\n return curve_instance.map_hash_encrypt(data_instances, mode=mode, hash_operator=hash_operator, salt=salt)\n File \"xxx/federatedml/secureprotol/elliptic_curve_encryption.py\", line 91, in map_hash_encrypt\n return plaintable.map(\n File \"xxx/fate_arch/common/profile.py\", line 318, in _fn\n rtn = func(*args, **kwargs)\n File \"xxx/fate_arch/computing/eggroll/_table.py\", line 87, in map\n return Table(self._rp.map(func))\n File \"xxxeggroll/python/eggroll/core/aspects.py\", line 30, in wrapper\n result = func(*args, **kwargs)\n File \"xxxeggroll/python/eggroll/roll_pair/roll_pair.py\", line 818, in map\n task_results = self._run_job(job=job)\n File \"xxxeggroll/python/eggroll/roll_pair/roll_pair.py\", line 475, in _run_job\n results.append(future.result())\n File \"/opt/python3/lib/python3.8/concurrent/futures/_base.py\", line 444, in result\n return self.__get_result()\n File \"/opt/python3/lib/python3.8/concurrent/futures/_base.py\", line 389, in __get_result\n raise self._exception\n File \"xxxeggroll/python/eggroll/core/datastructure/threadpool.py\", line 51, in run\n result = self.fn(*self.args, **self.kwargs)\n File \"xxxeggroll/python/eggroll/core/client.py\", line 99, in sync_send\n raise CommandCallError(command_uri, endpoint, e)\neggroll.core.client.CommandCallError: ('Failed to call command: CommandURI(_uri=v1/egg-pair/runTask) to endpoint: xxx.xxx.xxx.xxx:41100, caused by: ', <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Socket closed\"\n\tdebug_error_string = \"{\"created\":\"@1689756189.935578730\",\"description\":\"Error received from peer ipv4:xxx.xxx.xxx.xxx:41100\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Socket closed\",\"grpc_status\":14}\"\n>)

owlet42 commented 1 year ago

@summerghw See if there is any error in the other party, and what kind of tasks are you running in this environment?

summerghw commented 1 year ago

@owlet42 就是题主测试的这个功能, flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json。 我用2000端口的示例pipeline脚本测试,结果是正常的,toy测试也是正常的。是不是这个测试脚本里的conf文件或者dsl文件有问题呀。

owlet42 commented 1 year ago

@summerghw conf和dsl应该是没有问题的,你看看其他组件也没有报错。