Open AllanYangZhou opened 5 years ago
Thanks for the detailed issue. I am on leave and returning early December. I will try to look into the issue as soon as I can get to it (I'll have a bit of a backlog but will try to see what I can do in the first few weeks of the month).
I get the same error if I use weight_norm
and run the model on the GPU. If I use the CPU, then I get the following error:
Traceback (most recent call last):
File "main.py", line 214, in <module>
higher_train(opt, dataloader, generator, classifier, optimizer_a, optimizer_b)
File "main.py", line 185, in higher_train
real_loss.backward()
File "/home/user/anaconda3/envs/torch1.3/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/anaconda3/envs/torch1.3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
Hello. I've returned from leave and allocated some time to look into this issue over the next two weeks. Hopefully we'll make some progress and report back, or come back to you with questions.
In case it helps:
Looking at the implementation of weight norm in Pytorch (here), the error could come because the module's weight
attribute is set before every forward pass with a forward hook:
setattr(module, self.name, self.compute_weight(module))
However, because of the reparametrization of the weights, the weight
attribute is not a parameter of the module, as it is replaced by module.weight_g
and module.weight_v
, so I'm not sure how higher
is dealing with that.
Maybe the issue is with the function compute_weight
that generates w
from g
and v
, and higher
isn't patching that.
This does help. I've read through the weight_norm code from pytorch, and you are correct that this is something which higher isn't patching. We could write a hacky fix specifically for weight-norm, I think, but I would prefer a more general solution that caters to similar use cases. I will need to think through this properly and probably talk to some people from the pytorch team. I will attempt to look into this in the next two weeks, but it's going require some effort.
As you might guess, COVID hit and this did not get looked into. I'll chase this when I have time but unfortunately time is a scarce resource :(
I think this is also an issue for spectral_norm
which uses the forward hooks as well (probably the exact same method but I haven't checked)...so a general solution would be awesome because there are probably other functions which do the same thing.
If anyone is looking for a hack until this is fixed, I found that after doing a backward()
I have to put an input through the model once before entering the higher loop and then it works.
_ = model(torch.rand_like(x)) # this goes after backward() and before higher
with higher.innerloop_ctx(...) as (fmodel, fopt):
for (...) in inner loop:
...do inner loop
loss += f model(x)
loss.backward()
opt.step()
My hack above eventually broke due to something unrelated, so I took another look at this. I think this is a bug in the norm layers in Pytorch... They seem to make a dummy weight until the first forward pass overwrites it and the dummy weight doesn't get put on the right device with the rest of the model.
https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/spectral_norm.py#L143
The above line is what I am talking about... but IDK where or when it eventually gets put on the GPU in the normal flow. Wherever that is, it seems to cause a mismatch with higher...
wondering if this has been fixed? @jeffwillette I tried your hack but got this error instead:
RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false) but got TensorOptions(dtype=float, device=cuda:0, layout=Strided, requires_grad=false) (validate_outputs at /opt/conda/conda-bld/pytorch_1591914880026/work/torch/csrc/autograd/engine.cpp:484)
do you know what might be the issue?
Wondering if this has been fixed? Getting the following error:
RuntimeError: Function AddmmBackward returned an invalid gradient at index 0 - expected type TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)) but got TensorOptions(dtype=float, device=cuda:0, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Hi, thanks for your work on this library!
Using a weight normalized network in higher's inner loop raises the following error:
I can reproduce this by simply modifying the maml-omniglot example to weight_normalize the final linear layer (pasted below). The error only appears in the higher inner loop, I can evaluate the network on input data outside the inner loop with no error. I am on Ubuntu 16.0.4, Python 3.7.0, pytorch 1.3.0, and cuda 10.0.