Open Liangtaiwan opened 5 years ago
Are you using FusedAdam?
Im getting the same error using FusedAdam, initialized as such:
optimizer = FusedAdam(optimizer_grouped_parameters, lr=LEARNING_RATE, bias_correction=False, max_grad_norm=1.0)
optimizer = FP16_Optimizer(optimizer, static_loss_scale=LOSS_SCALE)
and then calling
optimizer.backward(loss)
optimizer.step()
ptimizer.zero_grad()
Any ideas as to what the problem is?
@mcarilli Yes, I am using FusedAdam. @matej-svejda I think the problem is causing by the some layer do feedward but do not backpropagate cause the layer do not relevance to the loss.
Here is a minimal example that produces the same problem:
import torch as T
from apex.optimizers import FP16_Optimizer, FusedAdam
class MinimalExample(T.nn.Module):
def __init__(self):
super().__init__()
self.used_linear = T.nn.Linear(1,1)
self.unused_linear = T.nn.Linear(1,1)
def forward(self, x):
return self.used_linear(x)
model = MinimalExample().to("cuda:0").half()
optimizer = FP16_Optimizer(FusedAdam(model.parameters()), dynamic_loss_scale=True)
inputs = T.randn(8,1)
targets = T.randn(8,1)
loss_fn = T.nn.MSELoss()
outputs = model(inputs.to(device="cuda:0", dtype=T.float16))
loss = loss_fn(outputs, targets.to(device="cuda:0", dtype=T.float16))
optimizer.backward(loss)
optimizer.step()
Parameters that are not touched during forward (or more concretely that do not contribute to the loss) have p.grad equal to None. However, FP16Optimizer calls _flatten_dense_tensors on all its parameters during its step() which in turn calls .contiguous on all the .grad of the paramters which fails if one of them is None.
As a quick and dirty work around for the time being: Find out the names of the parameters for which this happens and explicitly exclude them from the parameters of the optimizer.
E.g.
exclude_params = ['encoder._bert.pooler.dense.weight', 'encoder._bert.pooler.dense.bias']
optimizer = FP16_Optimizer(FusedAdam([p for (n,p) in model.named_parameters() if n not in exclude_params]), dynamic_loss_scale=True)
However, this only works if the parameters that do not contribute to the loss are always the same.
Here is an actual fix:
In apex/optimizers/fp16_optimizer.py line 147 currently reads
grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))
If you replace that with
grads_groups_flat.append(_flatten_dense_tensors([p.grad if p.grad is not None else p.new_zeros(p.size()) for p in group]))
you get rid of the bug. I would be happy to submit a pull request for this fix. However, I am not sure if this is really the optimal solution as we need to allocate a new all zeros tensor for each parameter that has 'None'-gradient even though this clearly adds nothing to the computation. I haven't found a way yet to get around that as all gradients and parameters get flattened so it is essential that we do have these zeros there.
I believe @FDecaYed has used FusedAdam for fine-tuning before, with the current incarnation. How did you do that? Did you construct
optimizer = FP16_Optimizer(FusedAdam([p for model.parameters() if p.requires_grad]), dynamic_loss_scale=True)
I believe @FDecaYed has used FusedAdam for fine-tuning before, with the current incarnation. How did you do that? Did you construct
optimizer = FP16_Optimizer(FusedAdam([p for model.parameters() if p.requires_grad]), dynamic_loss_scale=True)
That won't work. In the minimal example I gave above you can easily substitute that in without changing the outcome, since the unused linear layer still (in principle) requires gradients.
Also most finetuning workflows are probably not affected since in most cases every layer contributes to the forward computation.
@mcarilli I think I have seen this before. It was caused by something like here:
https://github.com/FDecaYed/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L740
Basically you load the whole model structure and weight, but did not used some head in some task, thus weight along those branch is not updated( there might be wasted compute as well but that's a separate issue)
I saw it while developing and did something like exclude_params
. I believe modifying the model or filtering parameter should be advised to avoid waste of compute at the time.
But the problem of None grad might be changing is a good point. Maybe it is worth to add check in Apex after all?
@Liangtaiwan @schoennenbeck @matej-svejda Thanks for brought this problem to our attention!
Here is an actual fix:
In apex/optimizers/fp16_optimizer.py line 147 currently reads
grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))
If you replace that with
grads_groups_flat.append(_flatten_dense_tensors([p.grad if p.grad is not None else p.new_zeros(p.size()) for p in group]))
you get rid of the bug. I would be happy to submit a pull request for this fix. However, I am not sure if this is really the optimal solution as we need to allocate a new all zeros tensor for each parameter that has 'None'-gradient even though this clearly adds nothing to the computation. I haven't found a way yet to get around that as all gradients and parameters get flattened so it is essential that we do have these zeros there.
if you have more than 1 gpu, just call this, everything will be ok:
model = torch.nn.DataParallel(model)
of course, you may need to call: loss = loss.mean() before backward.
what if I use just one GPU? is there a planned fix for this issue?
@adihaviv Yes, there is a planned fix. I'm reworking FusedAdam so that it won't require param flattening anymore (WIP branch is https://github.com/NVIDIA/apex/tree/multi_tensor_sgd) and None gradients should be acceptable in a single process.
@mcarilli Thank you for looking into this issue. Any update on when the fix will be completed?
Hi, I change the _flatten_dense_tensors([p.grad if p.grad is not None else p.new_zeros(p.size()) for p in group]) in fp16_optimizer.py But I get a new error in fused_adam.py: FusedAdam has been updated. Simply initialize it identically to torch.optim.Adam, and call step() with no arguments. if any(p is not None for p in [grads, output_params, scale, grad_norms]): raise RuntimeError('FusedAdam has been updated. Simply initialize it identically to torch.optim.Adam, and call step() with no arguments.')
The gradient of the tensor may be None, if the tensor is forward, but do not backward.
For example, I'm using BERT to finetune a model with the last second enocded_layer. The last layer is calculated when forward, however, it's not gradient do not be calculated when backward.
The following is the error message.