NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.41k stars 1.4k forks source link

BatchNorm RuntimeError: expected scalar type Half but found Float #301

Closed jbohnslav closed 5 years ago

jbohnslav commented 5 years ago

Thanks for all your great work. I've found that trying to keep batch norm in fp32 results in a RuntimeError. Here is the minimum example:


device = torch.device("cuda:%d"%(0) if torch.cuda.is_available() else "cpu")

criterion = CustomLoss(device=device)
model = MyModel().to(device)
dataloaders = ...
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=args.lr, eps=1e-8)
model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
dataiter = iter(dataloaders['train'])
images = next(dataiter)
images = images.to(device)
outputs = model(images)
loss = criterion(images, outputs)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Here is the error message:

python testing_apex.py 
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
torch.Size([2, 1, 320, 640]) torch.Size([2, 1, 320, 640])
Traceback (most recent call last):
  File "testing_apex.py", line 205, in <module>
    disparities,invalidations = model(left, right)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 355, in forward
    left_feats = self.feature_extractor(left)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 97, in forward
    x = self.preprocessor(x)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 33, in forward
    out = self.bn1(out)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: expected scalar type Half but found Float

If I use opt_level='O1', I get the error. If I use opt_level='O3', keep_batchnorm_fp32=True, I get the error. If I use opt_level='O3', keep_batchnorm_fp32=False, everything works fine (except that training results in nan losses, which is apparently to be expected from 'pure' fp16 training).

Information about system: python --version = 3.7.3 nvcc --version release 10.0, V10.0.130 torch.__version__ = '1.1.0a0+95ce796' Amp downloaded and installed today, May 13: commit = 4ff153cd50e4533b21dc1fd97c0ed609e19c4042

Thanks for your help!

ptrblck commented 5 years ago

Hi @jbohnslav, are you somehow manipulating the batchnorm parameters manually? Based on your code I cannot see any difference to vanilla amp training.

Best, ptrblck

jbohnslav commented 5 years ago

Thanks for your quick reply! I'm not messing with any batchnorm parameters. In the error messages, you can see the bug occurs in the feature_extractor module. I define that as follows: self.feature_extractor = SiameseTower(inplanes=inplanes). That module is defined here:

class SiameseTower(nn.Module):
    def __init__(self, inplanes=3, planes=32,blocks=3):
        super(SiameseTower, self).__init__()

        self.preprocessor = nn.Sequential(
            nn.Conv2d(inplanes, planes, kernel_size=3,stride=1,padding=1,bias=True),
            BasicBlock(planes,planes),
            BasicBlock(planes,planes),
            BasicBlock(planes,planes),            
        )
        block_list = []
        for block in range(blocks):
            # block_list.append(BasicBlock(planes, planes, stride=2))
            block_list.append(conv_bn_relu_downsample(planes))
        self.residual_blocks = nn.Sequential(*block_list)

        self.final = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1)

    def forward(self, x):
        x = self.preprocessor(x)
        x = self.residual_blocks(x)
        x = self.final(x)
        return(x)

And the BasicBlock module is very slightly changed from the official ResNet examples:

class BasicBlock(torch.nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, dilation=1,downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride,dilation=dilation)
        self.bn1 = nn.BatchNorm2d(planes)

        # self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes, stride,dilation=dilation)
        self.bn2 = nn.BatchNorm2d(planes)
        if stride > 1 and downsample is None:
            downsample = nn.Conv2d(inplanes, planes, kernel_size=3, stride=stride, padding=1,dilation=1)
        self.downsample = downsample
        self.stride = stride
        self.act = torch.nn.LeakyReLU(negative_slope=0.2)

        # self.
    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.act(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        # print(out.shape, residual.shape)
        out += residual
        out = self.act(out)

        return out

As you can see, there's no messing around with batchnorm parameters.

ptrblck commented 5 years ago

Could you post a small executable code snippet? I tried to reproduce this issue using your code, and wasn't sure which hyperparameters you are using. This script seems to be working. Could you check it and check for differences to your code?

mcarilli commented 5 years ago

In your script samples above, which line is producing the backtrace you posted? Is it coming from bn1/bn2 in BasicBlock, or the conv_bn_relu_downsample(planes) in residual_blocks? It appears you're using conv_bn_relu_downsample as an alternative to BasicBlock in residual_blocks.

jbohnslav commented 5 years ago

Could you post a small executable code snippet? I tried to reproduce this issue using your code, and wasn't sure which hyperparameters you are using. This script seems to be working. Could you check it and check for differences to your code?

Hey @ptrblck , Thanks for coming up with an executable version of my code. I ran your code example exactly, and got the same error message. Here it is:

python testing_apex.py 
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Traceback (most recent call last):
  File "testing_apex.py", line 306, in <module>
    output = model(x)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "testing_apex.py", line 295, in forward
    x = self.feature_extractor(x)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "testing_apex.py", line 245, in forward
    x = self.preprocessor(x)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "testing_apex.py", line 273, in forward
    out = self.bn1(out)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: expected scalar type Half but found Float

@mcarilli, good catch, I didn't notice I had replaced BasicBlock with conv_bn_relu_downsample. It's a moot point, as @ptrblck's code sample fixed this. Furthermore, in the error message I posted above, you can see it's coming from bn1 here

File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 97, in forward
    x = self.preprocessor(x)

In the SiameseTower module I posted, the preprocessor is just a Sequential with a conv2d followed by 3 BasicBlocks.

mcarilli commented 5 years ago

I installed current Apex master in both pytorch/pytorch:nightly-devel-cuda10.0-cudnn7 and pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel from pytorch dockerhub, ran Piotr's minimal script, and did not observe any errors.

Based on the release notes, Pytorch 1.1's commit hash is 142c973.

git checkout + git log indicate that your commit 95ce796 is dated Mon Apr 29 22:20:50 2019. The 1.1 commit 142c973 is dated Tue Apr 30 19:22:19 2019. Those are pretty darn close and I find it hard to believe that something between them broke Amp. The only substantial difference in my mind is python 3.7 in your environment vs python 3.6.8 in the "official" containers. However, I also made a container with pytorch 1.1 and python 3.7.3, and I ran Piotr's script once more in this container without any errors.

ptrblck commented 5 years ago

@mcarilli I'm running Python 3.7 on my machines and it's also working. If it helps, I could try to build the particular PyTorch commit and check again, as I've used some PyTorch nightly (~ a week old) for the test script.

ptrblck commented 5 years ago

@jbohnslav @mcarilli I could reproduce this error by disabling cuDNN (using torch.backends.cudnn.enabled = False).

@jbohnslav Could this be the issue? Are you disabling cuDNN (accidentally)?

jbohnslav commented 5 years ago

Hello all, thanks so much for your help on this! I was not setting torch.backends.cudnn.enabled = False. However, it seems as though there was a problem with my cudnn installation. In trying to verify the installation, I couldn't find cudnn.h. According to this comment on another issue, the cause was the fact that I installed cudnn with a .deb file.

I reinstalled from a tar file, then uninstalled and reinstalled pytorch and apex. This solved my issue.

Edit: with cudnn and amp properly working, my model size is a tiny fraction of what it was previously. On an RTX GPU, it runs ~twice as fast. Thanks again!

mcarilli commented 5 years ago

Great catch @ptrblck, this is a helpful gotcha to be aware of in the future.

@jbohnslav if you'll indulge us, I am curious about that 2X speedup. Do you mean that (amp O1 or O2, cudnn enabled) is 2X faster than (amp off or O0, cudnn disabled/unavailable)? If so, I am wondering how much of the 2X comes from enabling amp mixed precision and how much comes from enabling cudnn. Could you try running an epoch with (amp.initialize(..., enabled=False), cudnn enabled) so we can get the comparison with (amp O1 or O2, cudnn enabled)?

jbohnslav commented 5 years ago

@mcarilli,

Sure, I ran a few quick tests. I have a rather complex loss function with many components, therefore I'm listing training speed and inference speed. Training has forward pass, loss computation, backward pass, and optimizer step. Inference has only forward pass. Note that with opt_level='O3', keep_batchnorm_fp32=True, losses became nan.

Experiment Train speed (fps) Inference speed (fps)
fp32, cudnn=False 13.19 25.09
fp32, cudnn=True 22.81 50.87
fp16 (opt_level 'O1'), cudnn = True 10.30 63.91
fp16 (opt_level 'O2'), cudnn = True 10.14 64.32
fp16 (opt_level 'O3', keep_batchnorm_fp32=True), cudnn=True 23.96 65.02

I'm using a Titan RTX GPU. It seems like most of the speedup was just fixing my cudnn. There's a ~20% increase in speed with half precision during inference, but a ~50% slowdown during training. This isn't the ultimate speedup that I'd hoped for with Tensor Cores. This is outside the scope of this issue, however.