Open Jason-wwww opened 1 year ago
The job you perform has resource requirements, and you need to allocate more resources to FATE.
You can modify compute_core=4
in parties.conf to configure more resources for the FATE cluster.
The job you perform has resource requirements, and you need to allocate more resources to FATE. You can modify
compute_core=4
in parties.conf to configure more resources for the FATE cluster.
After modified compute_core
on host, did I need to docker-compose restart
?
You need to clean up everything and re-deploy.
You need to clean up everything and re-deploy.
Thsnks, it works after re-deploy and re-submit the job.
While checking the status flow task query -r guest -j 202111230933232084530 | grep -w f_status
, I got a failure:
How can I get some log about this failure and how to solve it ?
This means that one of the tasks of your job failed.
you need to do "docker exec -it
Then check the log directory in the fateflow directory.
There should be a directory named by the job id. Pay attention to the error logs.
This means that one of the tasks of your job failed.
you need to do "docker exec -it bash".
Then check the log directory in the fateflow directory.
There should be a directory named by the job id. Pay attention to the error logs.
Hi, I can't find the log directory under the fateflow folder. there is only one examples
folder under /data/projects/fate/fateflow
.
This is the output of job submit:
{ "data": { "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202210240645313003060&role=guest&party_id=9999", "code": 0, "dsl_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/job_dsl.json", "job_id": "202210240645313003060", "logs_directory": "/data/projects/fate/fateflow/logs/202210240645313003060", "message": "success", "model_info": { "model_id": "arbiter-10000#guest-9999#host-10000#model", "model_version": "202210240645313003060" }, "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/pipeline_dsl.json", "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/guest/9999/job_runtime_on_party_conf.json", "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/job_runtime_conf.json", "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/train_runtime_conf.json" }, "jobId": "202210240645313003060", "retcode": 0, "retmsg": "success" }
I can see the "logs_directory": "/data/projects/fate/fateflow/logs/202210240645313003060"
, however I can't find the directory.
docker exec -it "your fateflow container id" bash
The log is inside the container.
docker exec -it "your fateflow container id" bash
The log is inside the container.
Yes, I find the log file in container. But I can't find it as I have mentioned above.
"/data/projects/fate/fateflow/logs/202210240645313003060" This directory is not in the fateflow container?
i am getting the same error as mention in the starting except i have deployed kubefate through kubernetes. Same version of kubefate. Please help me out. I am stuck for long time.
@Mansi2487 Give spark-worker or nodemanager more resources:
# resources:
# requests:
# cpu: "2"
# memory: "4Gi"
# limits:
# cpu: "4"
# memory: "8Gi"
What deployment mode you are use?
What KubeFATE and FATE version you are using? 1.9.0 MUST Please state the KubeFATE and FATE version you found the issue 1.9.0 What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS.
To Reproduce
Refer to https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README.md
flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json
What happen? Get error of job submit:
{ "jobId": "202210210745344412160", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {9999: {'data': {'job_id': '202210210745344412160'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB * conf/service_conf.yaml#nodes * conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with eggroll_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'host': {10000: {'data': {'job_id': '202210210745344412160'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB * conf/service_conf.yaml#nodes * conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with eggroll_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'arbiter': {10000: {'data': {'components': {'data_transform_0': {'need_run': False}, 'evaluation_0': {'need_run': True}, 'hetero_feature_binning_0': {'need_run': False}, 'hetero_feature_selection_0': {'need_run': False}, 'hetero_lr_0': {'need_run': True}, 'intersection_0': {'need_run': False}, 'reader_0': {'need_run': False}}}, 'retcode': 0, 'retmsg': 'success'}}})\n" }