Closed BrotherHappy closed 2 years ago
Can you please provide the command you a running, the full traceback of this error, and the log file?
Following this issue I basically think your package versions are noy equal to ours. Can you please try with our Dockerfile including pytorch==1.8.1
, MinkowskiEngine@v0.5.4
and cuda 10.2
?
Thanks for your quick reply。 The traceback:
Traceback (most recent call last): File "tools/train.py", line 223, in <module> main() File "tools/train.py", line 219, in main meta=meta) File "/study/fcaf3d/mmdet3d/apis/train.py", line 34, in train_model meta=meta) File "/home/brother/Desktop/od/mmdetection/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.call_hook('after_train_iter') File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter runner.outputs['loss'].backward() File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/brother/anaconda3/envs/od/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
I did not use the dockerfile ,My pytorch=1.9.0 MinkowskiEngine=0.5.4, cuda=11.1 and
gpu=rtx3060`
Following this issue I basically think your package versions are noy equal to ours. Can you please try with our Dockerfile including
pytorch==1.8.1
,MinkowskiEngine@v0.5.4
andcuda 10.2
?
I'll try again。Thanks!
hi @BrotherHappy
I used one gpu like you
when you train your data this command might help you.
MASTER_PORT=29500 MASTER_ADDR='localhost' WORLD_SIZE=1 RANK=0 bash tools/dist_train.sh configs/fcaf3d/fcaf3d_sunrgbd-3d-10class.py
(mannually give parameters)
and comment out your train.sh like
`python3 $(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:3}
Btw, why no to use tools/train.py
instead of tools/dist_train.sh
when using a single gpu?
um.. I just wanted to follow the commands step by step. no special reason.
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered