NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.31k stars 1.38k forks source link

Error if the gradient of tensor is None. #131

Open Liangtaiwan opened 5 years ago

Liangtaiwan commented 5 years ago

The gradient of the tensor may be None, if the tensor is forward, but do not backward.

For example, I'm using BERT to finetune a model with the last second enocded_layer. The last layer is calculated when forward, however, it's not gradient do not be calculated when backward.

The following is the error message.

File "/usr/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 147, in step
    grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))
File "/usr/lib/python3.7/site-packages/torch/_utils.py", line 194, in _flatten_dense_tensors
    flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
File "/usr/lib/python3.7/site-packages/torch/_utils.py", line 194, in <listcomp>
    flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
AttributeError: 'NoneType' object has no attribute 'contiguous'
mcarilli commented 5 years ago

Are you using FusedAdam?

matej-svejda commented 5 years ago

Im getting the same error using FusedAdam, initialized as such:

optimizer = FusedAdam(optimizer_grouped_parameters, lr=LEARNING_RATE, bias_correction=False, max_grad_norm=1.0)
optimizer = FP16_Optimizer(optimizer, static_loss_scale=LOSS_SCALE)

and then calling

optimizer.backward(loss)
optimizer.step()
ptimizer.zero_grad()

Any ideas as to what the problem is?

Liangtaiwan commented 5 years ago

@mcarilli Yes, I am using FusedAdam. @matej-svejda I think the problem is causing by the some layer do feedward but do not backpropagate cause the layer do not relevance to the loss.

schoennenbeck commented 5 years ago

Here is a minimal example that produces the same problem:

import torch as T
from apex.optimizers import FP16_Optimizer, FusedAdam

class MinimalExample(T.nn.Module):
    def __init__(self):
        super().__init__()
        self.used_linear = T.nn.Linear(1,1)
        self.unused_linear = T.nn.Linear(1,1)

    def forward(self, x):
        return self.used_linear(x)

model = MinimalExample().to("cuda:0").half()
optimizer = FP16_Optimizer(FusedAdam(model.parameters()), dynamic_loss_scale=True)

inputs = T.randn(8,1)
targets = T.randn(8,1)
loss_fn = T.nn.MSELoss()

outputs = model(inputs.to(device="cuda:0", dtype=T.float16))
loss = loss_fn(outputs, targets.to(device="cuda:0", dtype=T.float16))
optimizer.backward(loss)
optimizer.step()

Parameters that are not touched during forward (or more concretely that do not contribute to the loss) have p.grad equal to None. However, FP16Optimizer calls _flatten_dense_tensors on all its parameters during its step() which in turn calls .contiguous on all the .grad of the paramters which fails if one of them is None.

As a quick and dirty work around for the time being: Find out the names of the parameters for which this happens and explicitly exclude them from the parameters of the optimizer.

E.g.

exclude_params = ['encoder._bert.pooler.dense.weight', 'encoder._bert.pooler.dense.bias']
optimizer = FP16_Optimizer(FusedAdam([p for (n,p) in model.named_parameters() if n not in exclude_params]), dynamic_loss_scale=True)

However, this only works if the parameters that do not contribute to the loss are always the same.

schoennenbeck commented 5 years ago

Here is an actual fix:

In apex/optimizers/fp16_optimizer.py line 147 currently reads

grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))

If you replace that with

grads_groups_flat.append(_flatten_dense_tensors([p.grad if p.grad is not None else p.new_zeros(p.size()) for p in group]))

you get rid of the bug. I would be happy to submit a pull request for this fix. However, I am not sure if this is really the optimal solution as we need to allocate a new all zeros tensor for each parameter that has 'None'-gradient even though this clearly adds nothing to the computation. I haven't found a way yet to get around that as all gradients and parameters get flattened so it is essential that we do have these zeros there.

mcarilli commented 5 years ago

I believe @FDecaYed has used FusedAdam for fine-tuning before, with the current incarnation. How did you do that? Did you construct

optimizer = FP16_Optimizer(FusedAdam([p for model.parameters() if p.requires_grad]), dynamic_loss_scale=True)
schoennenbeck commented 5 years ago

I believe @FDecaYed has used FusedAdam for fine-tuning before, with the current incarnation. How did you do that? Did you construct

optimizer = FP16_Optimizer(FusedAdam([p for model.parameters() if p.requires_grad]), dynamic_loss_scale=True)

That won't work. In the minimal example I gave above you can easily substitute that in without changing the outcome, since the unused linear layer still (in principle) requires gradients.

Also most finetuning workflows are probably not affected since in most cases every layer contributes to the forward computation.

FDecaYed commented 5 years ago

@mcarilli I think I have seen this before. It was caused by something like here: https://github.com/FDecaYed/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L740 Basically you load the whole model structure and weight, but did not used some head in some task, thus weight along those branch is not updated( there might be wasted compute as well but that's a separate issue) I saw it while developing and did something like exclude_params. I believe modifying the model or filtering parameter should be advised to avoid waste of compute at the time.

But the problem of None grad might be changing is a good point. Maybe it is worth to add check in Apex after all?

@Liangtaiwan @schoennenbeck @matej-svejda Thanks for brought this problem to our attention!

schinger commented 5 years ago

Here is an actual fix:

In apex/optimizers/fp16_optimizer.py line 147 currently reads

grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))

If you replace that with

grads_groups_flat.append(_flatten_dense_tensors([p.grad if p.grad is not None else p.new_zeros(p.size()) for p in group]))

you get rid of the bug. I would be happy to submit a pull request for this fix. However, I am not sure if this is really the optimal solution as we need to allocate a new all zeros tensor for each parameter that has 'None'-gradient even though this clearly adds nothing to the computation. I haven't found a way yet to get around that as all gradients and parameters get flattened so it is essential that we do have these zeros there.

if you have more than 1 gpu, just call this, everything will be ok:

model = torch.nn.DataParallel(model)

of course, you may need to call: loss = loss.mean() before backward.

adihaviv commented 5 years ago

what if I use just one GPU? is there a planned fix for this issue?

mcarilli commented 5 years ago

@adihaviv Yes, there is a planned fix. I'm reworking FusedAdam so that it won't require param flattening anymore (WIP branch is https://github.com/NVIDIA/apex/tree/multi_tensor_sgd) and None gradients should be acceptable in a single process.

DeepakSinghRawat commented 5 years ago

@mcarilli Thank you for looking into this issue. Any update on when the fix will be completed?

yeliu918 commented 4 years ago

Hi, I change the _flatten_dense_tensors([p.grad if p.grad is not None else p.new_zeros(p.size()) for p in group]) in fp16_optimizer.py But I get a new error in fused_adam.py: FusedAdam has been updated. Simply initialize it identically to torch.optim.Adam, and call step() with no arguments. if any(p is not None for p in [grads, output_params, scale, grad_norms]): raise RuntimeError('FusedAdam has been updated. Simply initialize it identically to torch.optim.Adam, and call step() with no arguments.')