Closed Son-Goku-gpu closed 4 years ago
Hi I got this problem one month ago, I fixed it by reinstalling the required library in a new Conda env by following the exact versions like
Also if you have multiple cuda installed in your system, please check to make sure that the symlink /usr/local/cuda is pointed to the right cuda version(should be cuda 10.0) and should match your pytorch's cuda version.
@tianweiy Hi guy, thanks for your solution. I checked my pytorch's cuda version by torch.version.cuda and get '10.1.243', and my symlink of /usr/local/cuda points to "CUDA Version 10.1.105". The envrionment of my server includes:
- OS: Ubuntu 16.04
- Python: 3.7.3
- CUDA: 10.1
- CUDNN: 7.4.1
- pytorch: 1.3.1
- gcc: 5.5.0
- cmake: 3.16.0
It seems not the version mismatch between cuda and complied pytorch. At the same time, when I train the model with a single gpu, I meet the problem below: This problem also seems to be mentioned in https://github.com/traveller59/spconv/issues/66, do you @poodarchu @tianweiy have any idea to solve them?
uhm. I only met the first problem in the past. It shows some cuda out of bound/illegal memory address when I am using cuda10.1 with pytorch 1.3. I still guess the second problem is related even if it is from spconv. Probably, do you want to try cuda 10 and pytorch 1.1, (and for convenience just try to match the python version also)? I don't have any of these problems with these.
also make sure to recompile spconv if you change your torch / cuda / etc..
@tianweiy Thanks for your advice, I will take more check on the envrionment. And could anyone else provide more information on it? @s-ryosky @poodarchu
@tianweiy @Son-Goku-gpu Hi, I am using cuda 10 and pytorch 1.1 and encounter the same error sometimes. Although I use python 3.7.5 instead of 3.6, I still don't think it's caused by these environments...Hope for more information.
Hi, I am facing same problem and I highly doubt this error has a lot to do with apex/multi gpu training. Did you install apex and what is your GPU no. and version?
@hz3014 Hi, I have installed apex as in the INSTALLATION.md. I have 8 TITAN Xp gpus in my server, and the bugs may happen when I use 2 or 4 or 8 gpus.
Sometimes my program is terminated by "CUDA error: an illegal memory acess was encountered" in the training process. I used official code and default config setting, only changing the data_root and work_dir, the bug occured in the training in both cases of single gpu and distributed multiple gpus. The picture below shows the error infomation:
Sometimes the training on a single gpu could also be terminated as below:
While this problems seems can be ignored in multi-gpu training:
The envrionment of my server includes:
Really weird! How can i solve the problems as they usually occurs? Could anyone provide some information on these problems? Thanks a lot!