facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.31k stars 2.49k forks source link

training job failed #1329

Closed manibharathy1 closed 2 years ago

manibharathy1 commented 2 years ago

Failure reason: I run with instance type g4dn.12xlarge with single node but training job was failed .So please help me to solve this issue.

AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.6 -m launch_ddp --config configs/dist-training-config.yaml" Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/ml/code/launch_ddp.py", line 42, in raise subprocess.CalledProcessError(returncode=process.returncode, cmd=joint_cmd) subprocess.CalledProcessError: Command 'python -m torch.distributed.launch --nnodes 1 --node_rank 0 --nproc_per_node 4 --master_addr algo-1 --master_port 55555 /opt/ml/code/train.py --config configs/dist-training-config.yaml' returned non-zero exit status 1.

Zizzzzzzz commented 2 years ago

Haved you solve it?