OpenDriveLab / UniAD

[CVPR'23 Best Paper Award] Planning-oriented Autonomous Driving
Apache License 2.0
3.11k stars 335 forks source link

RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

Open HamzaBenHaj opened 1 month ago

HamzaBenHaj commented 1 month ago

Hey everyone,

I am trying to get acquainted with UniAD and followed the instruction but when I tried to run the evaluation example:

./tools/uniad_dist_eval.sh ./projects/configs/stage1_track_map/base_track_map.py ./ckpts/uniad_base_track_map.pth 4

I receive the following error

Traceback (most recent call last): File "./tools/test.py", line 261, in main() File "./tools/test.py", line 227, in main model = MMDistributedDataParallel( File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. [E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801258 milliseconds before timing out. [E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out. [E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out. [E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out. [E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 232910) of binary: /home/hammar/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I tried then to run the train example (I can only use 4 GPUs):

./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 4

but same error:

Traceback (most recent call last): File "./tools/train.py", line 256, in Traceback (most recent call last): File "./tools/train.py", line 256, in [E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800494 milliseconds before timing out. main() File "./tools/train.py", line 245, in main custom_train_model( File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model main() File "./tools/train.py", line 245, in main custom_train_detector( File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector custom_train_model( File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model model = MMDistributedDataParallel( File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init custom_train_detector( File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector model = MMDistributedDataParallel( File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. [E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800524 milliseconds before timing out. [E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800471 milliseconds before timing out.

Loading NuScenes tables for version v1.0-trainval... ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 346625) of binary: /home/hblab/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Anyone encountered it before?

Thanks!