BrotherHappy commented 2 years ago

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

filaPro commented 2 years ago

Can you please provide the command you a running, the full traceback of this error, and the log file?

filaPro commented 2 years ago

Following this issue I basically think your package versions are noy equal to ours. Can you please try with our Dockerfile including pytorch==1.8.1, MinkowskiEngine@v0.5.4 and cuda 10.2?

BrotherHappy commented 2 years ago

Thanks for your quick reply。 The traceback: Traceback (most recent call last): File "tools/train.py", line 223, in <module> main() File "tools/train.py", line 219, in main meta=meta) File "/study/fcaf3d/mmdet3d/apis/train.py", line 34, in train_model meta=meta) File "/home/brother/Desktop/od/mmdetection/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.call_hook('after_train_iter') File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter runner.outputs['loss'].backward() File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered I did not use the dockerfile ,My pytorch=1.9.0 MinkowskiEngine=0.5.4, cuda=11.1 andgpu=rtx3060`

BrotherHappy commented 2 years ago

Following this issue I basically think your package versions are noy equal to ours. Can you please try with our Dockerfile including pytorch==1.8.1, MinkowskiEngine@v0.5.4 and cuda 10.2?

I'll try again。Thanks!

joshiaLee commented 2 years ago

hi @BrotherHappy I used one gpu like you when you train your data this command might help you. MASTER_PORT=29500 MASTER_ADDR='localhost' WORLD_SIZE=1 RANK=0 bash tools/dist_train.sh configs/fcaf3d/fcaf3d_sunrgbd-3d-10class.py (mannually give parameters) and comment out your train.sh like `python3 $(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:3}

-m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT `

filaPro commented 2 years ago

Btw, why no to use tools/train.py instead of tools/dist_train.sh when using a single gpu?

joshiaLee commented 2 years ago

um.. I just wanted to follow the commands step by step. no special reason.

SamsungLabs / fcaf3d

I follow the step of file 'read me' except that i use only a single gpu and it errors: "RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered" #19

-m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT `