CUDA error terminates the training process

V2AI / Det3D

World's first general purpose 3D object detection codebse.

https://arxiv.org/abs/1908.09492

Apache License 2.0

1.48k stars 299 forks source link

CUDA error terminates the training process #71

Closed Son-Goku-gpu closed 4 years ago

Son-Goku-gpu commented 4 years ago

Sometimes my program is terminated by "CUDA error: an illegal memory acess was encountered" in the training process. I used official code and default config setting, only changing the data_root and work_dir, the bug occured in the training in both cases of single gpu and distributed multiple gpus. The picture below shows the error infomation: 中途停顿

Sometimes the training on a single gpu could also be terminated as below: 单卡训的问题

While this problems seems can be ignored in multi-gpu training: 单卡问题被忽略

The envrionment of my server includes:

- OS: Ubuntu 16.04
- Python:  3.7.3
- CUDA: 10.1
- CUDNN: 7.4.1
- pytorch: 1.3.1
- gcc: 5.5.0
- cmake: 3.16.0
- nvidia driver version: 418.40.04
- gpu: 8 TITAN Xp

Really weird! How can i solve the problems as they usually occurs? Could anyone provide some information on these problems? Thanks a lot！

tianweiy commented 4 years ago

Hi I got this problem one month ago, I fixed it by reinstalling the required library in a new Conda env by following the exact versions like

OS: Ubuntu 16.04/18.04
Python: 3.6.5
PyTorch: 1.1
CUDA: 10.0
CUDNN: 7.5.0

Also if you have multiple cuda installed in your system, please check to make sure that the symlink /usr/local/cuda is pointed to the right cuda version(should be cuda 10.0) and should match your pytorch's cuda version.

Son-Goku-gpu commented 4 years ago

@tianweiy Hi guy, thanks for your solution. I checked my pytorch's cuda version by torch.version.cuda and get '10.1.243', and my symlink of /usr/local/cuda points to "CUDA Version 10.1.105". The envrionment of my server includes:

- OS: Ubuntu 16.04
- Python:  3.7.3
- CUDA: 10.1
- CUDNN: 7.4.1
- pytorch: 1.3.1
- gcc: 5.5.0
- cmake: 3.16.0

It seems not the version mismatch between cuda and complied pytorch. At the same time, when I train the model with a single gpu, I meet the problem below: 单卡训的问题 This problem also seems to be mentioned in https://github.com/traveller59/spconv/issues/66, do you @poodarchu @tianweiy have any idea to solve them?

tianweiy commented 4 years ago

uhm. I only met the first problem in the past. It shows some cuda out of bound/illegal memory address when I am using cuda10.1 with pytorch 1.3. I still guess the second problem is related even if it is from spconv. Probably, do you want to try cuda 10 and pytorch 1.1, (and for convenience just try to match the python version also)? I don't have any of these problems with these.

tianweiy commented 4 years ago

also make sure to recompile spconv if you change your torch / cuda / etc..

Son-Goku-gpu commented 4 years ago

@tianweiy Thanks for your advice, I will take more check on the envrionment. And could anyone else provide more information on it? @s-ryosky @poodarchu

Tai-Wang commented 4 years ago

@tianweiy @Son-Goku-gpu Hi, I am using cuda 10 and pytorch 1.1 and encounter the same error sometimes. Although I use python 3.7.5 instead of 3.6, I still don't think it's caused by these environments...Hope for more information.

hz3014 commented 4 years ago

Hi, I am facing same problem and I highly doubt this error has a lot to do with apex/multi gpu training. Did you install apex and what is your GPU no. and version?

Son-Goku-gpu commented 4 years ago

@hz3014 Hi, I have installed apex as in the INSTALLATION.md. I have 8 TITAN Xp gpus in my server, and the bugs may happen when I use 2 or 4 or 8 gpus.