Closed sdeven95 closed 2 years ago
This looks more like a NVIDIA hardware/software error. You should try it on a different GPU instance. Ensure that Pytorch is compatible with installed CUDA libraries and GPUs are compatible for mixed precision training.
I have updated torch and torchvision. Now the error is :
2022-08-16 11:13:51 - DEBUG - Training epoch 0 with 22096 samples /home/sheng/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [576, 144, 1, 1], strides() = [144, 1, 1, 1] bucket_view.sizes() = [576, 144, 1, 1], strides() = [144, 1, 144, 144] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass /home/sheng/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [576, 144, 1, 1], strides() = [144, 1, 1, 1] bucket_view.sizes() = [576, 144, 1, 1], strides() = [144, 1, 144, 144] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass 2022-08-16 11:14:06 - LOGS - Epoch: 0 [ 1/10000000], loss: 5.5593, LR: [0.1, 0.1], Avg. batch load time: 13.853, Elapsed time: 14.94 2022-08-16 11:14:07 - LOGS - Exception occurred that interrupted the training. CUDA error: an illegal memory access was encountered
with this limited information, it is hard to say anything. But seems like that issues are with your set-up. Please ensure that you are using appropriate CUDA drivers, cuDNN version, and compatible PyTorch version.
This is the version information:
NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2
(base) :~$ cat /usr/local/cuda/version.txt CUDA Version 10.2.89
(base) :~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
pip install torch==1.12.1+cu102 torchvision==0.13.1+cu102 torchaudio==0.12.1 -f https://download.pytorch.org/whl/torch_stable.html
It appears that Cuda driver is not compatible with PyTorch and cuDNN version.
why? I checked the pytorch website and found 1.12.1 could work with cuda 10.2. I also found in nvidia developer website cuDNN 7.6.5 support cuda 10.2. Is anything wrong?
Sorry, I confused it with my driver version. I was using CUDA 11.3.
Could you turn off channels_last flag and try again?
You are right. When I set
mixed_precision: true channels_last: false
It's OK. The error does not appear. Why?
I am training models with two rtx 2080ti 11g cards
It’s because these GPUs doesn’t support channel last format.
Thanks so much.
I also found channels last is a beta function of pytorch ???
(BETA) CHANNELS LAST MEMORY FORMAT IN PYTORCH Author: Vitaly Fedyunin
There are still many things to do, such as: Resolving ambiguity of N1HW and NC11 Tensors; Testing of Distributed Training support; Improving operators coverage.
The article is still in pytorch tutorials. Is it out-of-date?
You probably want to ask about it in PyTorch forums.
It looks like your issue is resolved, so closing it. Feel free to reopen if the issue is not resolved.
When I was training mobile net v3 model with mixed_precision = true, the program raised an error like this:
022-08-16 03:13:22 - DEBUG - Training epoch 0 with 66072 samples 2022-08-16 03:14:03 - LOGS - Epoch: 0 [ 1/10000000], loss: 5.1851, LR: [0.1, 0.1], Avg. batch load time: 38.484, Elapsed time: 40.62 2022-08-16 03:14:06 - LOGS - Exception occurred that interrupted the training. CUDA error: an illegal memory access was encountered
Do you have any suggestion?