NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.34k stars 1.39k forks source link

Model parallel: an illegal memory access was encountered #371

Open huangyjhust opened 5 years ago

huangyjhust commented 5 years ago

I'm running my model to process really large 3D volumes, so I have to define a model parallel like this: Class model(....): def forward(self, x):

x is on 'cuda:0'

    XA=self.A(x)
    XB=self.B(XA.to('cuda:1'))
    return XB

It runs well using float32, but still I want larger volume size or more channels, so I tried apex and it reported: RuntimeError: CUDA error: an illegal memory access was encountered from the scaler.py: self._has_overflow = self._overflow_buf.item() Any ideas?

ptrblck commented 5 years ago

Hi @huangyjhust,

could you post a (small) reproducible code snippet so that we could have a look?

huangyjhust commented 5 years ago

Hi @ptrblck A toy code for model parallel is written as follows:

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from apex.fp16_utils import *
from apex import amp, optimizers
from apex.multi_tensor_apply import multi_tensor_applier
from torch import optim

GPUs=['cuda:0','cuda:1']
Mode='fp16'
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.A=nn.Conv2d(1, 64, 1)
        self.B=nn.Conv2d(64, 1, 1)   
    def forward(self, x):
        X1=self.A(x).to(GPUs[1])
        Out=self.B(X1)
        return Out

def Loss(y_pred,y_true):
    overlap=torch.sum(y_pred.float()*y_true)
    bottom=torch.sum(y_pred.float()+y_true)
    Loss=1-2*(overlap+1e-4)/(bottom+1e-4)
    return Loss

Network=Model()
Network.A=Network.A.to(GPUs[0])
Network.B=Network.B.to(GPUs[1])
lr=1e-4
Optimizer=optim.Adam(list(Network.A.parameters())+list(Network.B.parameters()),lr=lr,amsgrad=True)
if Mode=='fp16':
    Network, Optimizer = amp.initialize(Network, Optimizer,opt_level='O2',loss_scale='dynamic')

Network.train()
X=torch.zeros([1,1,128,128]).to(GPUs[0])
Y=torch.ones([1,1,128,128]).to(GPUs[1])
Pred=Network(X)
LossAll=Loss(Pred,Y)
if Mode=='fp16':
    with amp.scale_loss(LossAll, Optimizer) as scaled_loss:
        scaled_loss.backward()
else:
    LossAll.backward()
Optimizer.step()
print('OK')

When Mode set as 'fp16', error is reported as follows:

Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
opt_level              : O2
enabled                : True
cast_model_type        : torch.float16
master_weights         : True
patch_torch_functions  : False
loss_scale             : dynamic
keep_batchnorm_fp32    : True
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
opt_level              : O2
enabled                : True
cast_model_type        : torch.float16
master_weights         : True
patch_torch_functions  : False
loss_scale             : dynamic
keep_batchnorm_fp32    : True
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-0244a3664e1b> in <module>()
     41 if Mode=='fp16':
     42     with amp.scale_loss(LossAll, Optimizer) as scaled_loss:
---> 43         scaled_loss.backward()
     44 else:
     45     LossAll.backward()

/home/hp/anaconda3/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
     64         if type is None:
     65             try:
---> 66                 next(self.gen)
     67             except StopIteration:
     68                 return False

/home/hp/anaconda3/lib/python3.5/site-packages/apex/amp/handle.py in scale_loss(loss, optimizers, loss_id, model, delay_unscale, delay_overflow_check)
    129             # For future fused optimizers that enable sync-free dynamic loss scaling,
    130             # should_skip will always be False.
--> 131             should_skip = False if delay_overflow_check else loss_scaler.update_scale()
    132             if should_skip:
    133                 for optimizer in optimizers:

/home/hp/anaconda3/lib/python3.5/site-packages/apex/amp/scaler.py in update_scale(self)
    191         # If the fused kernel is available, we only need one D2H memcopy and sync.
    192         if LossScaler.has_fused_kernel and self.dynamic and not self._has_overflow:
--> 193             self._has_overflow = self._overflow_buf.item()
    194 
    195         if self._has_overflow and self.dynamic:

RuntimeError: CUDA error: an illegal memory access was encountered

When Mode set as 'fp32', the code runs well.

Thanks!

psinger commented 5 years ago

Same error on my side, any ideas?

ptrblck commented 5 years ago

Sorry for the late reply. I'll take a look at the code now.

ptrblck commented 5 years ago

@huangyjhust @psinger I executed the code snippet on one of our machines (8x V100) and tested all four opt_levels. None threw an error. Example output for O2:

Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
OK

Could you check the code snippet again and see, if something is missing?

I'm using PyTorch 1.2.0a0+f6aac41 and apex master.

dmenig commented 5 years ago

Any chance it's due to the standard current version of pytorch easily installable ? 1.1.0

psinger commented 5 years ago

Sry for late reply, for me the error suddenly disappeared after restarting machine.

842974287 commented 5 years ago

I ran into the same issue when I use Apex with model parallelism (GPipe). Sometimes there're no errors but the loss becomes NAN, while sometimes it throws "illegal memory access".

More specifically, I used FusedLayerNorm in Apex. When I moved the layer that contains FusedLayerNorm to other GPUs (not the first one), it gave me the error said something like "Tensor not in the same device". In my environment, when cuda is initialized in a GPU, there'll be a log message. Then I tested it by moving the model to GPU 1-7. When I ran the model, before the FusedLayerNorm, GPU 0 got initialized. At that time I didn't have the time to dig into it, so I moved on using pytorch LayerNorm, and it worked fine.

Do you guys have any clues yet? I can write some toy code when I have time.

ptrblck commented 5 years ago

Hi @842974287,

a code snippet to reproduce this issue would be really helpful, as we are currently not able to reproduce it.

HobbitLong commented 5 years ago

Hi, @ptrblck ,

I found that, for the snippet that @huangyjhust provided, if you simply change GPUs=['cuda:0','cuda:1'] to GPUs=['cuda:0','cuda:2'], the illegal access error appears again.

The reason why I do this is that I want to put net A on GPUs [0,1], net B on GPUs [2,3] with data parallel to further accelerate it.

Do you have any idea about this error? Thanks.

ptrblck commented 5 years ago

@HobbitLong Thanks for the additional information. I've changed the GPUs to:

GPUs=['cuda:0', 'cuda:2']

and executed the code snippet with opt_level='O1' and opt_level='O2'. However, both runs do not throw this issue and print the last 'OK' statement. Did you make any other code changes to the script?

HobbitLong commented 5 years ago

Hi, @ptrblck ,

Thank you for your try.

I have tested multiple times (copy paste and only change this single line). But always run into the same error. I even just reinstalled everything 5 minutes ago (pytorch 1.2.0 and apex master), but still got the same error. Here is the error message:

Traceback (most recent call last):
  File "tmp.py", line 48, in <module>
    scaled_loss.backward()
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/data/vision/phillipi/rep-learn/torch_3.5/lib/python3.5/site-packages/apex/amp/handle.py", line 124, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/data/vision/phillipi/rep-learn/torch_3.5/lib/python3.5/site-packages/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

By the way, the environment I am using is like:

>>> import torch
>>> torch.__version__
'1.2.0'
>>> torch.version.cuda
'10.0.130'
>>> torch.backends.cudnn.version()
7602
>>> 

Random guess: any chance this is only related to machines with 4 GPUs, or in your case where 8 GPUs are present, this error only appears with cuda:4?

ptrblck commented 5 years ago

Thanks for the update and the idea to use another GPU id! I could indeed reproduce this issue using different GPUs!

Notes to self for debugging:

HobbitLong commented 5 years ago

Hi, @ptrblck,

Thanks for working on this!

Another thing I noticed might be related to this (but I am not sure).

Let us suppose we have two nets net_A and net_B, and 4 GPUs. If I put both net_A and net_B on all of the 4 GPUs with dataparallel, then the amp works well. However, if I put net_A on GPUs 0,2 and net_B on GPUs 1,3, then amp seems to crush in the sense that loss scale quickly drops from a large value to a small value and the loss becomes NaN. I tried this option just for faster training.

Hope this can add some information to this bug as well.

sunsheng commented 5 years ago

Hi @ptrblck I use pix2pixhd code has the same error on 2080ti, while titan xp everything is ok. and I copy huangyjhust's toy code, appear same error. I guess reason is 2080ti not support p2p.

extra error message is: THCudaCheck FAIL file=/pytorch/aten/src/ATen/native/cuda/Normalization.cuh line=628 error=77 : an illegal memory access was encountered THCudaCheck FAIL file=/pytorch/aten/src/THC/THCStorage.cpp line=49 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 85, in <module> with amp.scale_loss(loss_G, optimizer_G) as scaled_loss: scaled_loss.backward() File "/home/zhou/pix2pixHD/venv/lib/python3.6/site-packages/torch/tensor.py", line 120, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/zhou/pix2pixHD/venv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/ATen/native/cuda/Normalization.cuh:628

BramVanroy commented 5 years ago

Possible related to https://github.com/NVIDIA/apex/issues/319 where I found something strange going on with torch device as well.

chengmengli06 commented 4 years ago

@ptrblck @huangyjhust  I have tried the above two GPU parallel, and have no problem. But the following 4 GPU model still has the above problem.

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from apex.fp16_utils import *
from apex import amp, optimizers
from apex.multi_tensor_apply import multi_tensor_applier
from torch import optim

GPUs=['cuda:0','cuda:1', 'cuda:2', 'cuda:3']
Mode='fp16'
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.A=nn.Conv2d(1, 64, 1)
        self.A.to(GPUs[0])
        self.B=nn.Conv2d(64, 64, 1)   
        self.B.to(GPUs[1])
        self.C=nn.Conv2d(64, 64, 1)   
        self.C.to(GPUs[2])
        self.D=nn.Conv2d(64, 1, 1)   
        self.D.to(GPUs[3])
    def forward(self, x):
        X1=self.A(x ).to(GPUs[1])
        X2=self.B(X1).to(GPUs[2])
        X3=self.C(X2).to(GPUs[3])
        Out = self.D(X3)
        return Out

def Loss(y_pred,y_true):
    overlap=torch.sum(y_pred.float()*y_true)
    bottom=torch.sum(y_pred.float()+y_true)
    Loss=1-2*(overlap+1e-4)/(bottom+1e-4)
    return Loss

Network=Model()
#Network.A=Network.A.to(GPUs[0])
#Network.B=Network.B.to(GPUs[1])
lr=1e-4
Optimizer=optim.Adam(list(Network.A.parameters())+list(Network.B.parameters())+list(Network.C.parameters())+list(Network.D.parameters()),lr=lr,amsgrad=True)
if Mode=='fp16':
    Network, Optimizer = amp.initialize(Network, Optimizer,opt_level='O1') #,loss_scale='dynamic')

Network.train()
X=torch.zeros([512,1,128,128]).to(GPUs[0])
Y=torch.ones([512,1,128,128]).to(GPUs[3])
Pred=Network(X)
LossAll=Loss(Pred,Y)
if Mode=='fp16':
    with amp.scale_loss(LossAll, Optimizer) as scaled_loss:
        scaled_loss.backward()
else:
    LossAll.backward()
Optimizer.step()
print('OK')
HobbitLong commented 4 years ago

Hi, @ptrblck,

Just want to check, is there possibility that I can fix this problem easily by myself if you guys are busy with other important problems? If so, do you have some hints about this problem. Thanks!

chengmengli06 commented 4 years ago

I present a fix here, https://github.com/AlibabaPAI/apex/tree/fix_cuda_mem_bug @HobbitLong @ptrblck. The reason of this problem is multi-gpu access on one gpu stream.

Aria-K-Alethia commented 4 years ago

I present a fix here, https://github.com/AlibabaPAI/apex/tree/fix_cuda_mem_bug @HobbitLong @ptrblck. The reason of this problem is multi-gpu access on one gpu stream.

Can you tell me how to use your code? I installed your branch but the problem still exists.

Aria-K-Alethia commented 4 years ago

Hi, I also encountered this problem, how can I fix it?

chengmengli06 commented 4 years ago

I check that I could run the above 4 GPU demo without any problem. I do know what cause your problem, could you show your piece of code which still has problem? @Aria-K-Alethia

Could you help me to merge my changes? @ptrblck I was not authorized to push to this repository.

Aria-K-Alethia commented 4 years ago

I check that I could run the above 4 GPU demo without any problem. I do know what cause your problem, could you show your piece of code which still has problem? @Aria-K-Alethia

Could you help me to merge my changes? @ptrblck I was not authorized to push to this repository.

Here is my code:

import torch
import torch.nn as nn
from apex import amp

torch.cuda.set_device(torch.device('cuda:0'))

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv1d(1, 1, 3, padding=1).to('cuda:1')
        self.conv2 = nn.Conv1d(1, 1, 3, padding=1).to('cuda:0')
        self.linear = nn.Linear(3, 1).to('cuda:0')

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x.to('cuda:0'))
        x = x.view(x.shape[0], -1)
        x = self.linear(x)
        return x

net = Net()
optimizer = torch.optim.SGD(net.parameters(), 0.1)
net, optimizer = amp.initialize(net, optimizer, opt_level='O1')
x = torch.rand((64,1,3)).to('cuda:1')
criterion = nn.BCEWithLogitsLoss()
out = net(x)
target = torch.randint(0, 2, (64,)).float().to('cuda:0')
loss = criterion(out.squeeze(), target)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

I installed your branch by the command pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./. I run this code by the command python temp.py By the way, my python, torch and cuda version is 3.7, 1.4 and 10.1 respectively.

chengmengli06 commented 4 years ago

I test your code, there are no problem.

GPU: V100 16G python_version: 3.5.4 torch_version: 1.3.0 cuda_version: 10.0 torch.version.cuda: '10.1.243' cudnn version: 7.4.2

outputs:

Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: master_weights : None loss_scale : dynamic cast_model_type : None patch_torch_functions : True enabled : True keep_batchnorm_fp32 : None opt_level : O1 Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: master_weights : None loss_scale : dynamic cast_model_type : None patch_torch_functions : True enabled : True keep_batchnorm_fp32 : None opt_level : O1

Aria-K-Alethia commented 4 years ago

@chengmengli06 I don't know, but the problem still exists. Anyway, thank you for your help.

annukkaa commented 4 years ago

I've encountered the same problem, even though I'm training on just one GPU. Works fines with opt_level='O0' but not with any of the other levels

chengmengli06 commented 4 years ago

@teresabucho you can test my branch, which should be ok.

annukkaa commented 4 years ago

@chengmengli06 Yes, seems to work. Thanks

chengmengli06 commented 4 years ago

@ptrblck could my fix be accepted? https://github.com/NVIDIA/apex/pull/689

somedadaism commented 4 years ago

I also encounter the issue. Has there been any progress?