f-dangel / backpack

BackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.
https://backpack.pt/
MIT License
555 stars 55 forks source link

v1.4.0 no longer seems to support `backward()` with the `inputs` parameter referencing a sub-module's parameters #233

Closed cwognum closed 2 years ago

cwognum commented 2 years ago

I am playing around with the DomainBed repository. I noticed that for the implementation of Fishr, they specifically install version 1.3.0 and I was wondering why.

After a bit of experimentation, it seems that it is no longer possible to use backward(inputs=...) where inputs is a submodule. I adjusted the example from your documentation to replicate the issue:

from torch.nn import CrossEntropyLoss, Flatten, Linear, Sequential

from backpack import backpack, extend
from backpack.extensions import BatchGrad
from backpack.utils.examples import load_one_batch_mnist

X, y = load_one_batch_mnist(batch_size=512)

model = Sequential(Flatten(), Linear(784, 128), Linear(128, 10))  # I added an additional layer here
lossfunc = CrossEntropyLoss()

model = extend(model)
lossfunc = extend(lossfunc)

loss = lossfunc(model(X), y)
with backpack(BatchGrad()):
    loss.backward(inputs=list(model[-1].parameters()))  # I am trying to get the gradient with respect to the last submodule

for name, param in model[-1].named_parameters():  # I only loop over the parameters in the last submodule
    print(name)
    print(".grad.shape:             ", param.grad.shape)
    print(".grad_batch.shape:       ", param.grad_batch.shape)

With backpack-for-pytorch==1.4.0, this given

AttributeError: 'Parameter' object has no attribute 'grad_batch'

With backpack-for-pytorch==1.3.0, this prints the expected output:

weight
.grad.shape:              torch.Size([10, 128])
.grad_batch.shape:        torch.Size([512, 10, 128])
bias
.grad.shape:              torch.Size([10])
.grad_batch.shape:        torch.Size([512, 10])

I tried going through the git history of this repository to identify what changed between these two versions, but I have not managed to pin down the change that caused this. I was wondering whether this is intentional or a bug.

cwognum commented 2 years ago

Changing list(model[-1].parameters()) to list(model.parameters())[2:] (which effectively is the same thing), does work as expected. So it seems to specifically be caused by referencing a sub-module of the main module.

f-dangel commented 2 years ago

Hi Cas,

thanks for your detailed description and the code snippet. One main difference between 1.3.0 and 1.4.0 is that we replaced backward_hooks with full_backward_hooks. One explanation for the behavior you observe is that, somehow, the full_backward_hook is not triggered with list(model[-1]).parameters()), but with list(model.parameters()[2:]).

Best, Felix

f-dangel commented 2 years ago

As BackPACK was originally designed to work with loss.backward() without any arguments, you can try circumventing your issue by setting requires_grad=False for all parameters except those you are interested in, then running loss.backward() without specifying inputs=....

cwognum commented 2 years ago

Hi Felix,

Thank you for the quick response and the proposed workaround.

First of all: I actually think I made an error somewhere when I tried changing list(model[-1].parameters()) to list(model.parameters())[2:]. I can no longer reproduce this discrepancy. With version 1.4.0 both give the same AttributeError for me. With regards to your questions:

I am using torch==1.10.0

With backpack-for-pytorch==1.3.0, the debug information is:

[DEBUG] Extending Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=10, bias=True)
)
[DEBUG] Extending Flatten(start_dim=1, end_dim=-1)
[DEBUG] Extending Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Extending Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Extending CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7ffa5cb79070> on CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7ffa5cb79070> on Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7ffa5cb79070> on Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=10, bias=True)
)

With backpack-for-pytorch==1.4.0, the debug information is:

[DEBUG] Extending Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=10, bias=True)
)
[DEBUG] Extending Flatten(start_dim=1, end_dim=-1)
[DEBUG] Extending Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Extending Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Extending CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7fb880926dc0> on CrossEntropyLoss()
[DEBUG] Running extension hook on CrossEntropyLoss()

The hooks do not seem to be called at all for the module in this case. If I change list(model[-1].parameters()) to simply list(model.parameters()), it gives:

[DEBUG] Extending Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=10, bias=True)
)
[DEBUG] Extending Flatten(start_dim=1, end_dim=-1)
[DEBUG] Extending Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Extending Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Extending CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7f4cb80b0df0> on CrossEntropyLoss()
[DEBUG] Running extension hook on CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7f4cb80b0df0> on Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Running extension hook on Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7f4cb80b0df0> on Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Running extension hook on Linear(in_features=784, out_features=128, bias=True)

You specified that BackPACK was originally designed to work with loss.backward() without any arguments. Could it be that when extending the model, the first level of recursion gets "special treatment"? Any change between the two versions that would explain this behavior? And would you argue that this is expected behavior?

f-dangel commented 2 years ago

Hi, thanks for your clarifications.

Could it be that when extending the model, the first level of recursion gets "special treatment"?

There's no special treatment of the first hierarchy when extending a model. extend is called recursively on the submodules, as is indicated by the DEBUG messages you posted.

Any change between the two versions that would explain this behavior?

From the DEBUG messages, I still believe the different behavior results from full_backward_hook (1.4.0) versus backward_hook (1.3.0). I don't know how to further boil down the cause, but maybe backward works differently when inputs=... is specified.

I would recommend to try the above workaround. Let me know if it works.

ngonthier commented 2 years ago

The above workaround doesn't seem to work.

f-dangel commented 2 years ago

Hi @ngonthier,

can you describe in more detail how/why the workaround does not seem to work?

ngonthier commented 2 years ago

Hi, Even if I set requires_grad=False for all parameters except the one that I are interested in (namely Var1) and then run loss.backward(). The gradient will be computed for all the parameters of the model and not only for Var1. I am using the version 1.4.0

f-dangel commented 2 years ago

Hi,

that indeed sounds like unintended behavior. Could you provide a minimum working example that reproduces this issue?

cwognum commented 2 years ago

Hi @ngonthier and @f-dangel,

Sorry for not replying any sooner. I believe @ngonthier his observation is correct. See the minimal working example below:

from torch.nn import CrossEntropyLoss, Flatten, Linear, Sequential

from backpack import backpack, extend
from backpack.extensions import BatchGrad
from backpack.utils.examples import load_one_batch_mnist

X, y = load_one_batch_mnist(batch_size=512)

l1 = Linear(784, 128)
l1.requires_grad = False
l2 = Linear(128, 10)

model = Sequential(Flatten(), l1, l2)
lossfunc = CrossEntropyLoss()

model = extend(model, debug=True)
lossfunc = extend(lossfunc, debug=True)

loss = lossfunc(model(X), y)
with backpack(BatchGrad(), debug=True):
    loss.backward()

# This should fail for the first layer, right? It doesn't!
for name, param in model.named_parameters():
    print(name)
    print(".grad.shape:             ", param.grad.shape)
    print(".grad_batch.shape:       ", param.grad_batch.shape)

This is the DEBUG output:

[DEBUG] Extending Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=10, bias=True)
)
[DEBUG] Extending Flatten(start_dim=1, end_dim=-1)
[DEBUG] Extending Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Extending Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Extending CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7fe05e8db0d0> on CrossEntropyLoss()
[DEBUG] Running extension hook on CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7fe05e8db0d0> on Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Running extension hook on Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7fe05e8db0d0> on Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Running extension hook on Linear(in_features=784, out_features=128, bias=True)
1.weight
.grad.shape:              torch.Size([128, 784])
.grad_batch.shape:        torch.Size([512, 128, 784])
1.bias
.grad.shape:              torch.Size([128])
.grad_batch.shape:        torch.Size([512, 128])
2.weight
.grad.shape:              torch.Size([10, 128])
.grad_batch.shape:        torch.Size([512, 10, 128])
2.bias
.grad.shape:              torch.Size([10])
.grad_batch.shape:        torch.Size([512, 10])
f-dangel commented 2 years ago

Hi,

thanks for providing a script to reproduce the issue.

I think you're incorrectly setting requires_grad: It's an attribute of the module's parameters, not the module itself (correct me if I'm wrong).

The correct way to disable gradients is

for p in l1.parameters():
    p.requires_grad = False

instead of

l1.requires_grad = False
cwognum commented 2 years ago

You're right! I was under the impression that this would recursively disable grad for all parameters... :eyes:

With the suggested change it does work. I also checked if the output is the same for these two methods in version 1.3.0 once seeded and that is indeed the case:

[DEBUG] Extending Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): Linear(in_features=128, out_features=10, bias=True)
)
[DEBUG] Extending Flatten(start_dim=1, end_dim=-1)
[DEBUG] Extending Linear(in_features=784, out_features=128, bias=True)
[DEBUG] Extending Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Extending CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7f070e920f10> on CrossEntropyLoss()
[DEBUG] Running extension hook on CrossEntropyLoss()
[DEBUG] Running extension <backpack.extensions.firstorder.batch_grad.BatchGrad object at 0x7f070e920f10> on Linear(in_features=128, out_features=10, bias=True)
[DEBUG] Running extension hook on Linear(in_features=128, out_features=10, bias=True)
weight
.grad.shape:              torch.Size([10, 128])
.grad_batch.shape:        torch.Size([512, 10, 128])
bias
.grad.shape:              torch.Size([10])
.grad_batch.shape:        torch.Size([512, 10])

I think that leaves me with a last question before closing the issue: Should there be a more informative error / warning on Backpack's side when using the inputs argument?

f-dangel commented 2 years ago

Should there be a more informative error / warning on Backpack's side when using the inputs argument?

I'm not sure how one would detect that backward was called with the inputs argument from within BackPACK. Do you have an idea how to do that?

cwognum commented 2 years ago

No I'm not sure. I'm not familiar enough with the backpack codebase I'm afraid... I'll close this issue then. Thank you for thinking along these last couple of weeks. :slightly_smiling_face: :+1: