TuSimple / centerformer

Implementation for CenterFormer: Center-based Transformer for 3D Object Detection (ECCV 2022)
MIT License
293 stars 28 forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #4

Open dk-liang opened 1 year ago

dk-liang commented 1 year ago

when I run: python -m torch.distributed.launch --nproc_per_node=2 ./tools/train.py configs/waymo/voxelnet/waymo_centerformer.py

It shows the following error:

`2022-10-23 14:27:35,879 - INFO - Start running, work_dir: /dkliang/projects/synchronous/centerformer/work_dirs/waymo_centerformer 2022-10-23 14:27:35,880 - INFO - workflow: [('train', 1)], max: 20 epochs ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 44260) of binary: /dkliang/miniconda3/envs/centerformer/bin/python Traceback (most recent call last): File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/dkliang/miniconda3/envs/centerformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


         ./tools/train.py FAILED              

================================================== Root Cause: [0]: time: 2022-10-23_14:27:44 rank: 0 (local_rank: 0) exitcode: -11 (pid: 44260) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 44260"

Other Failures: [1]: time: 2022-10-23_14:27:44 rank: 1 (local_rank: 1) exitcode: -11 (pid: 44261) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 44261" **`

edwardzhou130 commented 1 year ago

This error message shows there is an issue in one subprocess. I have no idea what's wrong with it. Can you provide the environment you are running on and other output logs or error messages if there are any?