FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
418 stars 222 forks source link

Failure of job submit #784

Open Jason-wwww opened 1 year ago

Jason-wwww commented 1 year ago

What deployment mode you are use?

  1. docker-compose;

What KubeFATE and FATE version you are using? 1.9.0 MUST Please state the KubeFATE and FATE version you found the issue 1.9.0 What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS.

To Reproduce

Refer to https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README.md flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json

What happen? Get error of job submit: { "jobId": "202210210745344412160", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {9999: {'data': {'job_id': '202210210745344412160'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB * conf/service_conf.yaml#nodes * conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with eggroll_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'host': {10000: {'data': {'job_id': '202210210745344412160'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB * conf/service_conf.yaml#nodes * conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with eggroll_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'arbiter': {10000: {'data': {'components': {'data_transform_0': {'need_run': False}, 'evaluation_0': {'need_run': True}, 'hetero_feature_binning_0': {'need_run': False}, 'hetero_feature_selection_0': {'need_run': False}, 'hetero_lr_0': {'need_run': True}, 'intersection_0': {'need_run': False}, 'reader_0': {'need_run': False}}}, 'retcode': 0, 'retmsg': 'success'}}})\n" }

owlet42 commented 1 year ago

The job you perform has resource requirements, and you need to allocate more resources to FATE. You can modify compute_core=4 in parties.conf to configure more resources for the FATE cluster.

Jason-wwww commented 1 year ago

The job you perform has resource requirements, and you need to allocate more resources to FATE. You can modify compute_core=4 in parties.conf to configure more resources for the FATE cluster.

After modified compute_core on host, did I need to docker-compose restart ?

JingChen23 commented 1 year ago

You need to clean up everything and re-deploy.

Jason-wwww commented 1 year ago

You need to clean up everything and re-deploy.

Thsnks, it works after re-deploy and re-submit the job. While checking the status flow task query -r guest -j 202111230933232084530 | grep -w f_status, I got a failure: image How can I get some log about this failure and how to solve it ?

JingChen23 commented 1 year ago

This means that one of the tasks of your job failed.

you need to do "docker exec -it bash".

Then check the log directory in the fateflow directory.

There should be a directory named by the job id. Pay attention to the error logs.

Jason-wwww commented 1 year ago

This means that one of the tasks of your job failed.

you need to do "docker exec -it bash".

Then check the log directory in the fateflow directory.

There should be a directory named by the job id. Pay attention to the error logs.

Hi, I can't find the log directory under the fateflow folder. there is only one examples folder under /data/projects/fate/fateflow.

Jason-wwww commented 1 year ago

This is the output of job submit: { "data": { "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202210240645313003060&role=guest&party_id=9999", "code": 0, "dsl_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/job_dsl.json", "job_id": "202210240645313003060", "logs_directory": "/data/projects/fate/fateflow/logs/202210240645313003060", "message": "success", "model_info": { "model_id": "arbiter-10000#guest-9999#host-10000#model", "model_version": "202210240645313003060" }, "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/pipeline_dsl.json", "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/guest/9999/job_runtime_on_party_conf.json", "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/job_runtime_conf.json", "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/train_runtime_conf.json" }, "jobId": "202210240645313003060", "retcode": 0, "retmsg": "success" }

I can see the "logs_directory": "/data/projects/fate/fateflow/logs/202210240645313003060", however I can't find the directory.

JingChen23 commented 1 year ago

docker exec -it "your fateflow container id" bash

The log is inside the container.

Jason-wwww commented 1 year ago

docker exec -it "your fateflow container id" bash

The log is inside the container.

Yes, I find the log file in container. But I can't find it as I have mentioned above.

JingChen23 commented 1 year ago

"/data/projects/fate/fateflow/logs/202210240645313003060" This directory is not in the fateflow container?

Mansi2487 commented 1 year ago

i am getting the same error as mention in the starting except i have deployed kubefate through kubernetes. Same version of kubefate. Please help me out. I am stuck for long time.

owlet42 commented 1 year ago

@Mansi2487 Give spark-worker or nodemanager more resources:

  # resources:
    # requests:
      # cpu: "2"
      # memory: "4Gi"
    # limits:
      # cpu: "4"
      # memory: "8Gi"