sdeven95 commented 2 years ago

When I was training mobile net v3 model with mixed_precision = true, the program raised an error like this:

022-08-16 03:13:22 - DEBUG - Training epoch 0 with 66072 samples 2022-08-16 03:14:03 - LOGS - Epoch: 0 [ 1/10000000], loss: 5.1851, LR: [0.1, 0.1], Avg. batch load time: 38.484, Elapsed time: 40.62 2022-08-16 03:14:06 - LOGS - Exception occurred that interrupted the training. CUDA error: an illegal memory access was encountered

Do you have any suggestion?

sacmehta commented 2 years ago

This looks more like a NVIDIA hardware/software error. You should try it on a different GPU instance. Ensure that Pytorch is compatible with installed CUDA libraries and GPUs are compatible for mixed precision training.

sdeven95 commented 2 years ago

I have updated torch and torchvision. Now the error is :

2022-08-16 11:13:51 - DEBUG - Training epoch 0 with 22096 samples /home/sheng/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [576, 144, 1, 1], strides() = [144, 1, 1, 1] bucket_view.sizes() = [576, 144, 1, 1], strides() = [144, 1, 144, 144] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass /home/sheng/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [576, 144, 1, 1], strides() = [144, 1, 1, 1] bucket_view.sizes() = [576, 144, 1, 1], strides() = [144, 1, 144, 144] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass 2022-08-16 11:14:06 - LOGS - Epoch: 0 [ 1/10000000], loss: 5.5593, LR: [0.1, 0.1], Avg. batch load time: 13.853, Elapsed time: 14.94 2022-08-16 11:14:07 - LOGS - Exception occurred that interrupted the training. CUDA error: an illegal memory access was encountered

sacmehta commented 2 years ago

with this limited information, it is hard to say anything. But seems like that issues are with your set-up. Please ensure that you are using appropriate CUDA drivers, cuDNN version, and compatible PyTorch version.

sdeven95 commented 2 years ago

This is the version information:

NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2

(base) :~$ cat /usr/local/cuda/version.txt CUDA Version 10.2.89

(base) :~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

define CUDNN_MAJOR 7

define CUDNN_MINOR 6

define CUDNN_PATCHLEVEL 5

pip install torch==1.12.1+cu102 torchvision==0.13.1+cu102 torchaudio==0.12.1 -f https://download.pytorch.org/whl/torch_stable.html

sacmehta commented 2 years ago

It appears that Cuda driver is not compatible with PyTorch and cuDNN version.

sdeven95 commented 2 years ago

why? I checked the pytorch website and found 1.12.1 could work with cuda 10.2. I also found in nvidia developer website cuDNN 7.6.5 support cuda 10.2. Is anything wrong?

sacmehta commented 2 years ago

Sorry, I confused it with my driver version. I was using CUDA 11.3.

Could you turn off channels_last flag and try again?

sdeven95 commented 2 years ago

You are right. When I set

mixed_precision: true channels_last: false

It's OK. The error does not appear. Why?

sdeven95 commented 2 years ago

I am training models with two rtx 2080ti 11g cards

sacmehta commented 2 years ago

It’s because these GPUs doesn’t support channel last format.

sdeven95 commented 2 years ago

Thanks so much.

I also found channels last is a beta function of pytorch ？？？

(BETA) CHANNELS LAST MEMORY FORMAT IN PYTORCH Author: Vitaly Fedyunin

There are still many things to do, such as: Resolving ambiguity of N1HW and NC11 Tensors; Testing of Distributed Training support; Improving operators coverage.

The article is still in pytorch tutorials. Is it out-of-date?

sacmehta commented 2 years ago

You probably want to ask about it in PyTorch forums.

It looks like your issue is resolved, so closing it. Feel free to reopen if the issue is not resolved.

apple / ml-cvnets

AMP settings #43

define CUDNN_MAJOR 7

define CUDNN_MINOR 6

define CUDNN_PATCHLEVEL 5