DiffEqML / torchdyn

A PyTorch library entirely dedicated to neural differential equations, implicit models and related numerical methods
https://torchdyn.org
Apache License 2.0
1.35k stars 125 forks source link

NeuralODE adjoint incompatable with PytorchLighting gpus #116

Closed Bawaw closed 3 years ago

Bawaw commented 3 years ago

Hi torchdyn team,

Good job maintaining this library, I really like the new v1.0 release!

However, I might have encountered an issue when playing with the library... It happens when I run the following example code:

import torch
import pytorch_lightning as pl
from torchdyn.core import NeuralODE

class Learner(pl.LightningModule):
    def __init__(self, model:torch.nn.Module):
        super().__init__()
        self.model = model

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        print()
        print("Learner is on device: {}".format(self.device))
        print("Model is on device: {}".format(self.model.device))
        print("Model is on device: {}".format(self.model.vf_params.device))
        print()

        x = batch[0]
        _, z = self.model(x, torch.linspace(0, 1, 100))
        loss = z.abs().mean()
        return {'loss': loss}

    def configure_optimizers(self):
        return torch.optim.Adam(self.model.parameters(), lr=0.01)

    def train_dataloader(self):
        dataset = torch.utils.data.TensorDataset(torch.randn(10, 1))
        return torch.utils.data.DataLoader(dataset)

f = torch.nn.Sequential(
        torch.nn.Linear(1, 16),
        torch.nn.Tanh(),
        torch.nn.Linear(16, 1)
    )

model = NeuralODE(f, sensitivity='adjoint')
learn = Learner(model)
trainer = pl.Trainer(min_epochs=1, max_epochs=5, gpus=1)
trainer.fit(learn)

I get the following error:

"/path/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backwar
    Variable._execution_engine.run_backward(
RuntimeError: Function _ODEProblemFuncBackward returned an invalid gradient at index 0 - expected type TensorOptions(dtype=float, device=cpu, layout=Strided, 
lopt)) but got TensorOptions(dtype=float, device=cuda:0, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)

I think it might be related to the NeuralODE.vf_params which are still on the CPU during training.  Is this a known issue, sensitivity='autograd' or model = NeuralODE(f, sensitivity='adjoint').to(gpu) work find by the way. 

I'm on linux (Ubuntu 18.04.5 LTS), lib versions:

python.version 'Python 3.9.6' torch.version '1.9.0' pl.version '1.4.0' torchdyn.version '1.0'

Best

massastrello commented 3 years ago

The issue was cause by how we created the autograd function in ODEProblem (thus inherited by NeuralODE). The problem should now be fixed. Could you please try running your code again?

Bawaw commented 3 years ago

Works like a charm, thank you for the quick response!