when I run:
python -m torch.distributed.launch --nproc_per_node=2 ./tools/train.py configs/waymo/voxelnet/waymo_centerformer.py
It shows the following error:
`2022-10-23 14:27:35,879 - INFO - Start running, work_dir: /dkliang/projects/synchronous/centerformer/work_dirs/waymo_centerformer
2022-10-23 14:27:35,880 - INFO - workflow: [('train', 1)], max: 20 epochs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 44260) of binary: /dkliang/miniconda3/envs/centerformer/bin/python
Traceback (most recent call last):
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
This error message shows there is an issue in one subprocess. I have no idea what's wrong with it. Can you provide the environment you are running on and other output logs or error messages if there are any?
when I run:
python -m torch.distributed.launch --nproc_per_node=2 ./tools/train.py configs/waymo/voxelnet/waymo_centerformer.py
It shows the following error:
`2022-10-23 14:27:35,879 - INFO - Start running, work_dir: /dkliang/projects/synchronous/centerformer/work_dirs/waymo_centerformer 2022-10-23 14:27:35,880 - INFO - workflow: [('train', 1)], max: 20 epochs ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 44260) of binary: /dkliang/miniconda3/envs/centerformer/bin/python Traceback (most recent call last): File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
================================================== Root Cause: [0]: time: 2022-10-23_14:27:44 rank: 0 (local_rank: 0) exitcode: -11 (pid: 44260) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 44260"
Other Failures: [1]: time: 2022-10-23_14:27:44 rank: 1 (local_rank: 1) exitcode: -11 (pid: 44261) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 44261" **`