Closed jbohnslav closed 5 years ago
Hi @jbohnslav, are you somehow manipulating the batchnorm parameters manually? Based on your code I cannot see any difference to vanilla amp training.
Best, ptrblck
Thanks for your quick reply! I'm not messing with any batchnorm parameters. In the error messages, you can see the bug occurs in the feature_extractor
module. I define that as follows: self.feature_extractor = SiameseTower(inplanes=inplanes)
. That module is defined here:
class SiameseTower(nn.Module):
def __init__(self, inplanes=3, planes=32,blocks=3):
super(SiameseTower, self).__init__()
self.preprocessor = nn.Sequential(
nn.Conv2d(inplanes, planes, kernel_size=3,stride=1,padding=1,bias=True),
BasicBlock(planes,planes),
BasicBlock(planes,planes),
BasicBlock(planes,planes),
)
block_list = []
for block in range(blocks):
# block_list.append(BasicBlock(planes, planes, stride=2))
block_list.append(conv_bn_relu_downsample(planes))
self.residual_blocks = nn.Sequential(*block_list)
self.final = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1)
def forward(self, x):
x = self.preprocessor(x)
x = self.residual_blocks(x)
x = self.final(x)
return(x)
And the BasicBlock
module is very slightly changed from the official ResNet examples:
class BasicBlock(torch.nn.Module):
expansion = 1
def __init__(self, inplanes, planes, stride=1, dilation=1,downsample=None):
super(BasicBlock, self).__init__()
self.conv1 = conv3x3(inplanes, planes, stride,dilation=dilation)
self.bn1 = nn.BatchNorm2d(planes)
# self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes, stride,dilation=dilation)
self.bn2 = nn.BatchNorm2d(planes)
if stride > 1 and downsample is None:
downsample = nn.Conv2d(inplanes, planes, kernel_size=3, stride=stride, padding=1,dilation=1)
self.downsample = downsample
self.stride = stride
self.act = torch.nn.LeakyReLU(negative_slope=0.2)
# self.
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.act(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
# print(out.shape, residual.shape)
out += residual
out = self.act(out)
return out
As you can see, there's no messing around with batchnorm parameters.
Could you post a small executable code snippet? I tried to reproduce this issue using your code, and wasn't sure which hyperparameters you are using. This script seems to be working. Could you check it and check for differences to your code?
In your script samples above, which line is producing the backtrace you posted? Is it coming from bn1/bn2
in BasicBlock
, or the conv_bn_relu_downsample(planes)
in residual_blocks
? It appears you're using conv_bn_relu_downsample
as an alternative to BasicBlock in residual_blocks
.
Could you post a small executable code snippet? I tried to reproduce this issue using your code, and wasn't sure which hyperparameters you are using. This script seems to be working. Could you check it and check for differences to your code?
Hey @ptrblck , Thanks for coming up with an executable version of my code. I ran your code example exactly, and got the same error message. Here it is:
python testing_apex.py
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Traceback (most recent call last):
File "testing_apex.py", line 306, in <module>
output = model(x)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "testing_apex.py", line 295, in forward
x = self.feature_extractor(x)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "testing_apex.py", line 245, in forward
x = self.preprocessor(x)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "testing_apex.py", line 273, in forward
out = self.bn1(out)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
exponential_average_factor, self.eps)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: expected scalar type Half but found Float
@mcarilli, good catch, I didn't notice I had replaced BasicBlock
with conv_bn_relu_downsample
. It's a moot point, as @ptrblck's code sample fixed this. Furthermore, in the error message I posted above, you can see it's coming from bn1
here
File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 97, in forward
x = self.preprocessor(x)
In the SiameseTower
module I posted, the preprocessor
is just a Sequential
with a conv2d
followed by 3 BasicBlock
s.
I installed current Apex master in both pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
and pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel
from pytorch dockerhub, ran Piotr's minimal script, and did not observe any errors.
Based on the release notes, Pytorch 1.1's commit hash is 142c973.
git checkout + git log indicate that your commit 95ce796 is dated Mon Apr 29 22:20:50 2019. The 1.1 commit 142c973 is dated Tue Apr 30 19:22:19 2019. Those are pretty darn close and I find it hard to believe that something between them broke Amp. The only substantial difference in my mind is python 3.7 in your environment vs python 3.6.8 in the "official" containers. However, I also made a container with pytorch 1.1 and python 3.7.3, and I ran Piotr's script once more in this container without any errors.
@mcarilli I'm running Python 3.7 on my machines and it's also working. If it helps, I could try to build the particular PyTorch commit and check again, as I've used some PyTorch nightly (~ a week old) for the test script.
@jbohnslav @mcarilli I could reproduce this error by disabling cuDNN (using torch.backends.cudnn.enabled = False
).
@jbohnslav Could this be the issue? Are you disabling cuDNN (accidentally)?
Hello all, thanks so much for your help on this! I was not setting torch.backends.cudnn.enabled = False
. However, it seems as though there was a problem with my cudnn installation. In trying to verify the installation, I couldn't find cudnn.h
. According to this comment on another issue, the cause was the fact that I installed cudnn
with a .deb file.
I reinstalled from a tar
file, then uninstalled and reinstalled pytorch and apex. This solved my issue.
Edit: with cudnn and amp properly working, my model size is a tiny fraction of what it was previously. On an RTX GPU, it runs ~twice as fast. Thanks again!
Great catch @ptrblck, this is a helpful gotcha to be aware of in the future.
@jbohnslav if you'll indulge us, I am curious about that 2X speedup. Do you mean that (amp O1 or O2, cudnn enabled) is 2X faster than (amp off or O0, cudnn disabled/unavailable)? If so, I am wondering how much of the 2X comes from enabling amp mixed precision and how much comes from enabling cudnn. Could you try running an epoch with (amp.initialize(..., enabled=False)
, cudnn enabled) so we can get the comparison with (amp O1 or O2, cudnn enabled)?
@mcarilli,
Sure, I ran a few quick tests. I have a rather complex loss function with many components, therefore I'm listing training speed and inference speed. Training has forward pass, loss computation, backward pass, and optimizer step. Inference has only forward pass. Note that with opt_level='O3', keep_batchnorm_fp32=True
, losses became nan
.
Experiment | Train speed (fps) | Inference speed (fps) |
---|---|---|
fp32, cudnn=False | 13.19 | 25.09 |
fp32, cudnn=True | 22.81 | 50.87 |
fp16 (opt_level 'O1'), cudnn = True | 10.30 | 63.91 |
fp16 (opt_level 'O2'), cudnn = True | 10.14 | 64.32 |
fp16 (opt_level 'O3', keep_batchnorm_fp32=True), cudnn=True | 23.96 | 65.02 |
I'm using a Titan RTX GPU. It seems like most of the speedup was just fixing my cudnn. There's a ~20% increase in speed with half precision during inference, but a ~50% slowdown during training. This isn't the ultimate speedup that I'd hoped for with Tensor Cores. This is outside the scope of this issue, however.
Thanks for all your great work. I've found that trying to keep batch norm in fp32 results in a RuntimeError. Here is the minimum example:
Here is the error message:
If I use
opt_level='O1'
, I get the error. If I useopt_level='O3', keep_batchnorm_fp32=True
, I get the error. If I useopt_level='O3', keep_batchnorm_fp32=False
, everything works fine (except that training results in nan losses, which is apparently to be expected from 'pure' fp16 training).Information about system:
python --version = 3.7.3
nvcc --version release 10.0, V10.0.130
torch.__version__ = '1.1.0a0+95ce796'
Amp downloaded and installed today, May 13:commit = 4ff153cd50e4533b21dc1fd97c0ed609e19c4042
Thanks for your help!