Open huangyjhust opened 5 years ago
Hi @huangyjhust,
could you post a (small) reproducible code snippet so that we could have a look?
Hi @ptrblck A toy code for model parallel is written as follows:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from apex.fp16_utils import *
from apex import amp, optimizers
from apex.multi_tensor_apply import multi_tensor_applier
from torch import optim
GPUs=['cuda:0','cuda:1']
Mode='fp16'
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.A=nn.Conv2d(1, 64, 1)
self.B=nn.Conv2d(64, 1, 1)
def forward(self, x):
X1=self.A(x).to(GPUs[1])
Out=self.B(X1)
return Out
def Loss(y_pred,y_true):
overlap=torch.sum(y_pred.float()*y_true)
bottom=torch.sum(y_pred.float()+y_true)
Loss=1-2*(overlap+1e-4)/(bottom+1e-4)
return Loss
Network=Model()
Network.A=Network.A.to(GPUs[0])
Network.B=Network.B.to(GPUs[1])
lr=1e-4
Optimizer=optim.Adam(list(Network.A.parameters())+list(Network.B.parameters()),lr=lr,amsgrad=True)
if Mode=='fp16':
Network, Optimizer = amp.initialize(Network, Optimizer,opt_level='O2',loss_scale='dynamic')
Network.train()
X=torch.zeros([1,1,128,128]).to(GPUs[0])
Y=torch.ones([1,1,128,128]).to(GPUs[1])
Pred=Network(X)
LossAll=Loss(Pred,Y)
if Mode=='fp16':
with amp.scale_loss(LossAll, Optimizer) as scaled_loss:
scaled_loss.backward()
else:
LossAll.backward()
Optimizer.step()
print('OK')
When Mode set as 'fp16', error is reported as follows:
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
opt_level : O2
enabled : True
cast_model_type : torch.float16
master_weights : True
patch_torch_functions : False
loss_scale : dynamic
keep_batchnorm_fp32 : True
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
opt_level : O2
enabled : True
cast_model_type : torch.float16
master_weights : True
patch_torch_functions : False
loss_scale : dynamic
keep_batchnorm_fp32 : True
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-0244a3664e1b> in <module>()
41 if Mode=='fp16':
42 with amp.scale_loss(LossAll, Optimizer) as scaled_loss:
---> 43 scaled_loss.backward()
44 else:
45 LossAll.backward()
/home/hp/anaconda3/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
64 if type is None:
65 try:
---> 66 next(self.gen)
67 except StopIteration:
68 return False
/home/hp/anaconda3/lib/python3.5/site-packages/apex/amp/handle.py in scale_loss(loss, optimizers, loss_id, model, delay_unscale, delay_overflow_check)
129 # For future fused optimizers that enable sync-free dynamic loss scaling,
130 # should_skip will always be False.
--> 131 should_skip = False if delay_overflow_check else loss_scaler.update_scale()
132 if should_skip:
133 for optimizer in optimizers:
/home/hp/anaconda3/lib/python3.5/site-packages/apex/amp/scaler.py in update_scale(self)
191 # If the fused kernel is available, we only need one D2H memcopy and sync.
192 if LossScaler.has_fused_kernel and self.dynamic and not self._has_overflow:
--> 193 self._has_overflow = self._overflow_buf.item()
194
195 if self._has_overflow and self.dynamic:
RuntimeError: CUDA error: an illegal memory access was encountered
When Mode set as 'fp32', the code runs well.
Thanks!
Same error on my side, any ideas?
Sorry for the late reply. I'll take a look at the code now.
@huangyjhust @psinger
I executed the code snippet on one of our machines (8x V100) and tested all four opt_levels
.
None threw an error.
Example output for O2
:
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
OK
Could you check the code snippet again and see, if something is missing?
I'm using PyTorch 1.2.0a0+f6aac41
and apex master.
Any chance it's due to the standard current version of pytorch easily installable ? 1.1.0
Sry for late reply, for me the error suddenly disappeared after restarting machine.
I ran into the same issue when I use Apex with model parallelism (GPipe). Sometimes there're no errors but the loss becomes NAN, while sometimes it throws "illegal memory access".
More specifically, I used FusedLayerNorm in Apex. When I moved the layer that contains FusedLayerNorm to other GPUs (not the first one), it gave me the error said something like "Tensor not in the same device". In my environment, when cuda is initialized in a GPU, there'll be a log message. Then I tested it by moving the model to GPU 1-7. When I ran the model, before the FusedLayerNorm, GPU 0 got initialized. At that time I didn't have the time to dig into it, so I moved on using pytorch LayerNorm, and it worked fine.
Do you guys have any clues yet? I can write some toy code when I have time.
Hi @842974287,
a code snippet to reproduce this issue would be really helpful, as we are currently not able to reproduce it.
Hi, @ptrblck ,
I found that, for the snippet that @huangyjhust provided,
if you simply change GPUs=['cuda:0','cuda:1']
to GPUs=['cuda:0','cuda:2']
, the illegal access error appears again.
The reason why I do this is that I want to put net A
on GPUs [0,1], net B
on GPUs [2,3] with data parallel to further accelerate it.
Do you have any idea about this error? Thanks.
@HobbitLong Thanks for the additional information.
I've changed the GPUs
to:
GPUs=['cuda:0', 'cuda:2']
and executed the code snippet with opt_level='O1'
and opt_level='O2'
. However, both runs do not throw this issue and print the last 'OK'
statement. Did you make any other code changes to the script?
Hi, @ptrblck ,
Thank you for your try.
I have tested multiple times (copy paste and only change this single line). But always run into the same error. I even just reinstalled everything 5 minutes ago (pytorch 1.2.0 and apex master), but still got the same error. Here is the error message:
Traceback (most recent call last):
File "tmp.py", line 48, in <module>
scaled_loss.backward()
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/data/vision/phillipi/rep-learn/torch_3.5/lib/python3.5/site-packages/apex/amp/handle.py", line 124, in scale_loss
should_skip = False if delay_overflow_check else loss_scaler.update_scale()
File "/data/vision/phillipi/rep-learn/torch_3.5/lib/python3.5/site-packages/apex/amp/scaler.py", line 200, in update_scale
self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered
By the way, the environment I am using is like:
>>> import torch
>>> torch.__version__
'1.2.0'
>>> torch.version.cuda
'10.0.130'
>>> torch.backends.cudnn.version()
7602
>>>
Random guess: any chance this is only related to machines with 4 GPUs, or in your case where 8 GPUs are present, this error only appears with cuda:4?
Thanks for the update and the idea to use another GPU id! I could indeed reproduce this issue using different GPUs!
Notes to self for debugging:
Hi, @ptrblck,
Thanks for working on this!
Another thing I noticed might be related to this (but I am not sure).
Let us suppose we have two nets net_A
and net_B
, and 4 GPUs.
If I put both net_A
and net_B
on all of the 4 GPUs with dataparallel, then the amp works well.
However, if I put net_A
on GPUs 0,2 and net_B
on GPUs 1,3, then amp seems to crush in the sense that loss scale quickly drops from a large value to a small value and the loss becomes NaN. I tried this option just for faster training.
Hope this can add some information to this bug as well.
Hi @ptrblck I use pix2pixhd code has the same error on 2080ti, while titan xp everything is ok. and I copy huangyjhust's toy code, appear same error. I guess reason is 2080ti not support p2p.
extra error message is:
THCudaCheck FAIL file=/pytorch/aten/src/ATen/native/cuda/Normalization.cuh line=628 error=77 : an illegal memory access was encountered THCudaCheck FAIL file=/pytorch/aten/src/THC/THCStorage.cpp line=49 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 85, in <module> with amp.scale_loss(loss_G, optimizer_G) as scaled_loss: scaled_loss.backward() File "/home/zhou/pix2pixHD/venv/lib/python3.6/site-packages/torch/tensor.py", line 120, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/zhou/pix2pixHD/venv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/ATen/native/cuda/Normalization.cuh:628
Possible related to https://github.com/NVIDIA/apex/issues/319 where I found something strange going on with torch device
as well.
@ptrblck @huangyjhust I have tried the above two GPU parallel, and have no problem. But the following 4 GPU model still has the above problem.
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from apex.fp16_utils import *
from apex import amp, optimizers
from apex.multi_tensor_apply import multi_tensor_applier
from torch import optim
GPUs=['cuda:0','cuda:1', 'cuda:2', 'cuda:3']
Mode='fp16'
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.A=nn.Conv2d(1, 64, 1)
self.A.to(GPUs[0])
self.B=nn.Conv2d(64, 64, 1)
self.B.to(GPUs[1])
self.C=nn.Conv2d(64, 64, 1)
self.C.to(GPUs[2])
self.D=nn.Conv2d(64, 1, 1)
self.D.to(GPUs[3])
def forward(self, x):
X1=self.A(x ).to(GPUs[1])
X2=self.B(X1).to(GPUs[2])
X3=self.C(X2).to(GPUs[3])
Out = self.D(X3)
return Out
def Loss(y_pred,y_true):
overlap=torch.sum(y_pred.float()*y_true)
bottom=torch.sum(y_pred.float()+y_true)
Loss=1-2*(overlap+1e-4)/(bottom+1e-4)
return Loss
Network=Model()
#Network.A=Network.A.to(GPUs[0])
#Network.B=Network.B.to(GPUs[1])
lr=1e-4
Optimizer=optim.Adam(list(Network.A.parameters())+list(Network.B.parameters())+list(Network.C.parameters())+list(Network.D.parameters()),lr=lr,amsgrad=True)
if Mode=='fp16':
Network, Optimizer = amp.initialize(Network, Optimizer,opt_level='O1') #,loss_scale='dynamic')
Network.train()
X=torch.zeros([512,1,128,128]).to(GPUs[0])
Y=torch.ones([512,1,128,128]).to(GPUs[3])
Pred=Network(X)
LossAll=Loss(Pred,Y)
if Mode=='fp16':
with amp.scale_loss(LossAll, Optimizer) as scaled_loss:
scaled_loss.backward()
else:
LossAll.backward()
Optimizer.step()
print('OK')
Hi, @ptrblck,
Just want to check, is there possibility that I can fix this problem easily by myself if you guys are busy with other important problems? If so, do you have some hints about this problem. Thanks!
I present a fix here, https://github.com/AlibabaPAI/apex/tree/fix_cuda_mem_bug @HobbitLong @ptrblck. The reason of this problem is multi-gpu access on one gpu stream.
I present a fix here, https://github.com/AlibabaPAI/apex/tree/fix_cuda_mem_bug @HobbitLong @ptrblck. The reason of this problem is multi-gpu access on one gpu stream.
Can you tell me how to use your code? I installed your branch but the problem still exists.
Hi, I also encountered this problem, how can I fix it?
I check that I could run the above 4 GPU demo without any problem. I do know what cause your problem, could you show your piece of code which still has problem? @Aria-K-Alethia
Could you help me to merge my changes? @ptrblck I was not authorized to push to this repository.
I check that I could run the above 4 GPU demo without any problem. I do know what cause your problem, could you show your piece of code which still has problem? @Aria-K-Alethia
Could you help me to merge my changes? @ptrblck I was not authorized to push to this repository.
Here is my code:
import torch
import torch.nn as nn
from apex import amp
torch.cuda.set_device(torch.device('cuda:0'))
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv1d(1, 1, 3, padding=1).to('cuda:1')
self.conv2 = nn.Conv1d(1, 1, 3, padding=1).to('cuda:0')
self.linear = nn.Linear(3, 1).to('cuda:0')
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x.to('cuda:0'))
x = x.view(x.shape[0], -1)
x = self.linear(x)
return x
net = Net()
optimizer = torch.optim.SGD(net.parameters(), 0.1)
net, optimizer = amp.initialize(net, optimizer, opt_level='O1')
x = torch.rand((64,1,3)).to('cuda:1')
criterion = nn.BCEWithLogitsLoss()
out = net(x)
target = torch.randint(0, 2, (64,)).float().to('cuda:0')
loss = criterion(out.squeeze(), target)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
I installed your branch by the command pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
.
I run this code by the command python temp.py
By the way, my python, torch and cuda version is 3.7, 1.4 and 10.1 respectively.
I test your code, there are no problem.
GPU: V100 16G python_version: 3.5.4 torch_version: 1.3.0 cuda_version: 10.0 torch.version.cuda: '10.1.243' cudnn version: 7.4.2
outputs:
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: master_weights : None loss_scale : dynamic cast_model_type : None patch_torch_functions : True enabled : True keep_batchnorm_fp32 : None opt_level : O1 Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: master_weights : None loss_scale : dynamic cast_model_type : None patch_torch_functions : True enabled : True keep_batchnorm_fp32 : None opt_level : O1
@chengmengli06 I don't know, but the problem still exists. Anyway, thank you for your help.
I've encountered the same problem, even though I'm training on just one GPU. Works fines with opt_level='O0' but not with any of the other levels
@teresabucho you can test my branch, which should be ok.
@chengmengli06 Yes, seems to work. Thanks
@ptrblck could my fix be accepted? https://github.com/NVIDIA/apex/pull/689
I also encounter the issue. Has there been any progress?
I'm running my model to process really large 3D volumes, so I have to define a model parallel like this: Class model(....): def forward(self, x):
x is on 'cuda:0'
It runs well using float32, but still I want larger volume size or more channels, so I tried apex and it reported: RuntimeError: CUDA error: an illegal memory access was encountered from the scaler.py: self._has_overflow = self._overflow_buf.item() Any ideas?