f-dangel / backpack

BackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.
https://backpack.pt/
MIT License
555 stars 55 forks source link

Does Backpack Support Reusing Layers (First Order Extensions) #248

Closed bchen0 closed 2 years ago

bchen0 commented 2 years ago

Hi,

Does backpack allow for the reuse of layers for first-order extensions, like in say a Siamese network? I only need this for first-order extensions, in particular batch grads. An example is given below - this produces a "AttributeError: 'Linear' object has no attribute 'input0'" error.

Thanks!

import torch.nn as nn
import torch
from backpack import backpack, extend
from backpack.extensions import BatchGrad

class TestModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Linear(5, 5)

    def forward(self, x):
        return self.net(x[:, :5]) + self.net(x[:, 5:])

test_module = TestModule()
extend(test_module)

rand_vec = torch.randn(5, 10)
loss = test_module(rand_vec).sum()

with backpack(BatchGrad()):
    loss.backward()
fKunstner commented 2 years ago

Hi!

Interesting use-case. This won't work out of the box, backpack assumes each layer is used only once, sequentially.

There might be a workaround with a bit of post-processing without diving into the internals. Something along the lines of defining the network using two (different) modules and adding a step that keeps track of which layers should have the same weights and sync them if they change.

Can you go in a bit more details as to what you are looking for? I'm not sure I follow what quantity you expect. (5 gradients which contain the sum of the gradients of each layer? 2 pairs of 5 gradients, as if the two linear modules had the same weights but were distinct? Something else? Are you training the network at the same time?)

bchen0 commented 2 years ago

Hi,

Apologies for the late response. For more details/context - what I'm trying to do is create a symmetric function by letting f be any arbitrary network and considering the sum f(x) + f(-x). If that were it, I think it would suffice to calculate f(x) and f(-x) in separate forward passes and sum the gradients. However, the issue is that I want to calculate g(f(x) + f(-x)) and calculate the loss on the output of that.

The example I gave was not quite what I wanted - it would be something more similar to the code block below. Do you have any ideas how this could be done? Unfortunately I think that setting up two networks and then averaging the gradients, or averaging the weights after updates would produce updates that aren't quite right.

class TestModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Linear(5, 5)

    def forward(self, x):
        return self.net(x) + self.net(-x)

Thanks

fKunstner commented 2 years ago

That should still work. The gradients with respect to the parameters of f sum up. Github's not the best for math but

𝜕/𝜕w g(f(x,w) + f(−x,w)) 
= g'(f(x,w) + f(−x,w)) ⋅ 𝜕/𝜕w [f(x,w) + f(−x,w)]
= g'(f(x,w) + f(−x,w)) ⋅ [ 𝜕/𝜕w f(x,w) + 𝜕/𝜕w f(−x,w)]
= g'(f(x,w) + f(−x,w)) ⋅ 𝜕/𝜕w f(x,w) + g'(f(x,w) + f(−x,w)) ⋅ 𝜕/𝜕w f(−x,w)

This gives you the same as if f(x,w) and f(-x, v) had different parameters w, v and you would take the gradients wrt to v and w individually and sum them. This won't work for 2nd-order extensions but individual gradients will.

Here's an example that defines some symmetric network and does 3 steps of GD (as I guess you would normally) and a workaround that defines two sub-networks and keeps them in sync manually. It also adds the individual gradients so you can do something with them: basic_training.txt, backpack_workaround.txt

To work with pytorch's optimizer interface we can give it the .parameters() of one (say the positive) sub-network and replace the .grad of the .parameters() of the positive network by the sum of the .grad of the two networks after a .backward() but before optim.step() (that's what sum_grads_into_posnet below does). Then after an update to the positive network do sync_weights to keep them in sync.

The code for the network:

class SymmetricNetwork_backpackworkaround(nn.Module):
    def __init__(self):
        super().__init__()

        self.pos_net = SomeNetwork()
        self.neg_net = SomeNetwork()
        self.sync_weights()

    def forward(self, x):
        return self.pos_net(x) + self.neg_net(-x)

    def sync_weights(self):
        """Copies the weights of the positive net into the negative one"""
        for p1, p2 in zip(self.pos_net.parameters(), self.neg_net.parameters()):
            p2.data = p1.data

    def sum_grads_into_posnet(self):
        """Sums the gradients of both networks into the gradients of net1"""
        for p1, p2 in zip(self.pos_net.parameters(), self.neg_net.parameters()):
            # Maybe give them another name to not confuse them?
            if hasattr(p1, "grad") and hasattr(p2, "grad"):
                p1.grad = p1.grad + p2.grad
            if hasattr(p1, "grad_batch") and hasattr(p2, "grad_batch"):
                p1.grad_batch = p1.grad_batch + p2.grad_batch

The training loop looks like:

    network = SymmetricNetwork_backpackworkaround()
    extend(network)

    # Give the optimizer only the weights of the positive network.
    # We'll manually sum the gradients of both networks into the positive one.
    optim = torch.optim.SGD(params=network.pos_net.parameters(), lr=0.01)

    max_iter = 3
    for t in range(max_iter):

        network.zero_grad()
        loss = lossfunc(network(X), y)

        with backpack(BatchGrad()):
            loss.backward()
            network.sum_grads_into_posnet()

        print("")
        print("do something with the individual (summed) gradients...")
        print("")

        # Update the weights of the positive network.
        # Optim uses the .grad attribute, and we summed the gradients of the
        # positive and negative network, so this will take a step using the
        # full gradient.
        optim.step()
        network.sync_weights()

After the 3 steps of GD, they output the same thing

========================================
Parameters after one step - Normal setup
========================================
Parameter containing:
tensor([[-0.0036,  0.2279, -0.3423, -0.3135, -0.1547],
        [ 0.1533, -0.0048,  0.4697, -0.0236,  0.1770],
        [-0.1161, -0.0710, -0.3663, -0.2654, -0.1486],
        [ 0.0192,  0.1629,  0.2359, -0.2674, -0.1736],
        [ 0.1391,  0.3268, -0.0653,  0.2751, -0.0603]], requires_grad=True)
Parameter containing:
tensor([ 0.0108,  0.6527, -0.4625, -0.3195, -0.1850], requires_grad=True)
Parameter containing:
tensor([[ 0.0228,  0.7392, -0.2010, -0.1286, -0.1976]], requires_grad=True)
Parameter containing:
tensor([0.2174], requires_grad=True)
===============================================
Parameters after one step - Backpack workaround
===============================================
Parameter containing:
tensor([[-0.0036,  0.2279, -0.3423, -0.3135, -0.1547],
        [ 0.1533, -0.0048,  0.4697, -0.0236,  0.1770],
        [-0.1161, -0.0710, -0.3663, -0.2654, -0.1486],
        [ 0.0192,  0.1629,  0.2359, -0.2674, -0.1736],
        [ 0.1391,  0.3268, -0.0653,  0.2751, -0.0603]], requires_grad=True)
Parameter containing:
tensor([ 0.0108,  0.6527, -0.4625, -0.3195, -0.1850], requires_grad=True)
Parameter containing:
tensor([[ 0.0228,  0.7392, -0.2010, -0.1286, -0.1976]], requires_grad=True)
Parameter containing:
tensor([0.2174], requires_grad=True)

===============================================
Same, but for the negative newtork:
===============================================
Parameter containing:
tensor([[-0.0036,  0.2279, -0.3423, -0.3135, -0.1547],
        [ 0.1533, -0.0048,  0.4697, -0.0236,  0.1770],
        [-0.1161, -0.0710, -0.3663, -0.2654, -0.1486],
        [ 0.0192,  0.1629,  0.2359, -0.2674, -0.1736],
        [ 0.1391,  0.3268, -0.0653,  0.2751, -0.0603]], requires_grad=True)
Parameter containing:
tensor([ 0.0108,  0.6527, -0.4625, -0.3195, -0.1850], requires_grad=True)
Parameter containing:
tensor([[ 0.0228,  0.7392, -0.2010, -0.1286, -0.1976]], requires_grad=True)
Parameter containing:
tensor([0.2174], requires_grad=True)
bchen0 commented 2 years ago

Oh, of course, thanks! I appreciate the incredibly thorough response - this is exactly what I was looking for