Error in training submodules of model with adjoint method

cversteeg commented 2 years ago

Describe the bug

I am trying to train a sequential auto-encoder model with a NODE as my dynamics estimator. As part of this model, I have a set of encoders that compute initial conditions for the NODE. I hold out some conditions of my training data from the NODE, using them only to train the encoder networks.

(here is a simple diagram of the relevant model features)

I am using two optimizers to do this, shown below:

def configure_optimizers(self):
        heldin_optimizer = torch.optim.Adam(
            self.parameters(),
            lr=self.hparams.learning_rate,
            weight_decay=self.hparams.weight_decay,
        )
        heldout_optimizer = torch.optim.Adam(
            list(self.encoder.parameters()) + list(self.ic_linear.parameters()),
            lr=self.hparams.learning_rate,
            weight_decay=self.hparams.weight_decay,
        )
        return heldin_optimizer, heldout_optimizer

Now the problem: when I pass the loss from the heldout conditions to the held-out optimizer, I get the following error:

''' RuntimeError: One of the differentiated Tensors does not require grad '''

Digging around, I found that this error only appears when the model uses adjoint sensitivity to compute gradients, but not when it uses the standard autograd method. (i.e., changing "sensitivity" in the constructor from "adjoint" to "autograd" lets it work). ''' self.decoder = NeuralODE( vector_field_net, sensitivity=self.train_method, solver=self.solver, solver_adjoint=self.solver, ) ''' Obviously I'd like to use the adjoint method to train my model, especially as my models grow in size.

One other note: I have temporarily gotten around problem this by manually changing the sensitivity of my NODE depending on the optimizer that I am using. ''' if optimizer_idx == 0: # heldin self.decoder.sensalg = 'adjoint' self.decoder.sensitivity = 'adjoint' if optimizer_idx == 1: # heldout self.decoder.sensalg = 'autograd' self.decoder.sensitivity = 'autograd' ''' I suspect that this will be much more memory intensive than being able to compute the gradients with the adjoint.

Expected behavior

I expected the adjoint to be able to train model components that come before the NODE.

Thanks for your help!

massastrello commented 2 years ago

adjoint and interpolated_adjoint sensitivities should be able to back-propagate to the encoder parameters. This is a working example with e.g. interpolated_adjoint

encoders = nn.ModuleList([
    nn.Linear(1, 2),
    nn.ReLU(),
    nn.Linear(2, 2),
    nn.ReLU(),
    nn.Linear(2, 1),
])
vector_field = nn.Sequential(
    nn.Linear(1, 2),
    nn.Tanh(),
    nn.Linear(2, 1),
)
node = NeuralDE(vector_field, solver='dopri5', sensitivity='interpolated_adjoint')
model = nn.Sequential(*encoders, node)
opt = torch.optim.Adam(encoders.parameters(), lr=1e-3)
def loss_fn(y, y0): return torch.mean((y - y0)**2)

x = torch.randn(100, 1)
y = torch.randn(100, 1)
_, yh = model(x)

loss = loss_fn(yh, y)
loss.backward()
opt.step()
for p in encoders.parameters():
    print(p.grad)
opt.zero_grad()

## Your vector field callable (nn.Module) should have both time `t` and state `x` as arguments, we've wrapped it for you.
## tensor([[0.0128],
##         [0.0199]])
## tensor([0.0251, 0.0210])
## tensor([[0.2461, 0.2011],
##         [0.0000, 0.0000]])
## tensor([0.3502, 0.0000])
## tensor([[0.5597, 0.0000]])
## tensor([0.8893])

Do you have a sample code of the specific class of encoders you are using? How is the computed initial condition passed to the node?

cversteeg commented 2 years ago

Thanks for the quick response! Here is my encoder network:

self.encoder = nn.GRU(
            input_size=heldin_size,
            hidden_size=encoder_size,
            batch_first=True,
            bidirectional=True,
        )
        #### Instantiate linear mapping to initial conditions
        self.ic_linear = nn.Linear(2 * encoder_size, latent_size)

Here is neural ODE network (with hyperparams set in my model constructor)

vector_field = []
vector_field.append(nn.Linear(latent_size, vf_hidden_size))
vector_field.append(act_func())
vector_field.append(nn.LayerNorm(vf_hidden_size))
for k in range(self.hparams.vf_num_layers - 1):
    vector_field.append(nn.Linear(vf_hidden_size, vf_hidden_size))
    vector_field.append(act_func())
    vector_field.append(nn.LayerNorm(vf_hidden_size))

vector_field.append(nn.Linear(vf_hidden_size, latent_size))

vector_field_net = nn.Sequential(*vector_field)

#### Define the NeuralODE decoder and readout network
self.train_method = train_method
self.solver = solver
self.decoder = NeuralODE(
    vector_field_net,
    sensitivity=self.train_method,
    solver=self.solver,
    solver_adjoint=self.solver,
)

And here is the forward pass of my network. '''

 _, h_n = self.encoder(data)
#### Combine output from fwd and bwd encoders
h_n = torch.cat([*h_n], -1)
#### Compute initial condition with dropout
h_n_drop = self.dropout(h_n)
_, latents = self.decoder(ic_drop, t_span)
ic = self.ic_linear(h_n_drop)
ic_drop = self.dropout(ic)

'''

Perhaps it's because I don't wrap them all together in a nn.Sequential object?

cversteeg commented 2 years ago

Update:

from unicodedata import bidirectional
import torch.nn as nn
import torch
from torchdyn.core import NeuralDE

n_batch = 10
n_time = 100
n_neuron = 3

class test_model(nn.Module):
    def __init__(self, n_time, n_neuron):
        super().__init__()
        self.n_time = n_time
        self.n_neuron = n_neuron
        self.encoder =  nn.GRU(input_size = self.n_neuron, hidden_size = 25, batch_first= True,  bidirectional =False)
        self.ic_map = nn.Linear(25, n_neuron)
        vector_field = nn.Sequential(
            nn.Linear(n_neuron, 2),
            nn.Tanh(),
            nn.Linear(2, n_neuron),
        )
        self.node = NeuralDE(vector_field, solver ='dopri5', sensitivity='adjoint')

    def forward(self, x):
        tspan = torch.linspace(0,1,100)
        _, encs = self.encoder(x)
        ics = self.ic_map(torch.squeeze(encs))
        _,latents = self.node(ics, tspan)
        latents = latents.transpose(0, 1)
        return latents

model = test_model(n_time = n_time, n_neuron=n_neuron)
opt = torch.optim.Adam(model.encoder.parameters(), lr=1e-3)
def loss_fn(y, y0): return torch.mean((y - y0)**2)

x = torch.randn(n_batch,n_time,n_neuron)
y = torch.randn(n_batch,n_time,n_neuron)
yh = model(x)

yh = torch.squeeze(yh)
loss = loss_fn(yh, y)
loss.backward()
opt.step()
for p in model.encoder.parameters():
    print(p.grad)
opt.zero_grad()

With your example as a template, I found that this model seems to work fine. I'm doing some gradient manipulation (only using some samples to train) which might be causing some weirdness. I'll hammer away at the difference between this code and mine and see what the issue is.

cversteeg commented 2 years ago

Ok, I think I've figured out what the problem is, if anyone ever has similar issues. I am using PyTorch Lightning to run my experiments and there is an issue with the automated training that prevents the adjoint from being used for multiple optimizers in when self.automatic_optimization == True.

https://pytorch-lightning.readthedocs.io/en/latest/model/build_model_advanced.html

Manually implementing the optimization procedure allows training to work without issue. Thanks for the help in triangulating the issue!

Zymrael commented 2 years ago

Closing this for now - thanks for the additional info!

DiffEqML / torchdyn

Error in training submodules of model with adjoint method #168