DiffEqML / torchdyn

A PyTorch library entirely dedicated to neural differential equations, implicit models and related numerical methods
https://torchdyn.org
Apache License 2.0
1.4k stars 130 forks source link

Cannot explain Shape Mismatch #120

Open MaxH1996 opened 3 years ago

MaxH1996 commented 3 years ago

Hi, I am currently working with the torchdyn package and I am getting an error that I cannot really explain:

File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torchdyn/numerics/sensitivity.py", line 152, in backward
    t_adj_sol, A = odeint(adjoint_dynamics, A, t_span[i - 1:i + 1].flip(0), solver, atol=atol, rtol=rtol)
  File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torchdyn/numerics/odeint.py", line 87, in odeint
    dt = init_step(f, k1, x, t, solver.order, atol, rtol)
  File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torchdyn/numerics/utils.py", line 39, in init_step
    d0, d1 = hairer_norm(x0 / scale), hairer_norm(f0 / scale)
RuntimeError: The size of tensor a (1203142) must match the size of tensor b (1206497) at non-singleton dimension 0

I know this error is specific to my particular code and usage of torchdyn, but mainly I am interested in why this mismatch occurs. The shape of x0 and f0 that I input are both [8000, 3], so I do not understand how I can get a tensor of size (1203142) or (1206497) . It appears to happen in the backpropagation step, because just simply passing in values is without any errors.

Do you maybe have any idea why this would occur?

Zymrael commented 3 years ago

This error is happening while solving the adjoint dynamics for your net. The key lines are 47 onwards

xT, λT, μT = sol[-1], grad_output[-1][-1], torch.zeros_like(vf_params)

which are then concatenated and flattened, giving you the tensor of size (1203142). Does that match (1203142) or (1206497) for your specific network architecture? It also appears to be happening at your init step (see line 39).

Could you share (at a high level) what your f is?

MaxH1996 commented 3 years ago

Thanks for your quick response! It is quite hard to share my f actually because there is a whole lot going on. But here is the actual class that I call:


class Func(nn.Module):

    def __init__(
        self,
        nuc,
        up,
        down,
        neural_net = Net
    ):
        super().__init__()
        self.net = Net(nuc, up, down)

    def forward(self, t, x, rn, batch_dim, n_elec):

        x = x.reshape(batch_dim, n_elec, 3)
        _,_, x = self.net(x, rn)
        x = x.reshape(batch_dim*n_elec,3)

        return x

Not sure if that helps at all, and then I call NeuralODE and Func is called using functools.partial for the extra arguments. What I did see is that the mismatch is in f0 , but x0 has the correct shape at line 39 init_step. Correct in the sense that it matches with the variable scale.

I'd have to check exactly if the flattening and concatenation would match for my architecture, but I think those numbers would make sense.

Btw, if I use the normal odeint without the adjoint I do not get this problem.

Zymrael commented 3 years ago

Identifying what the difference 1206497 - 1203142 = 3355 represents in terms of elements is key here. The shape 1206497 is determined during initialization of the adjoint as a concat of

xT, λT, μT = sol[-1], grad_output[-1][-1], torch.zeros_like(vf_params)

whereas 1203142 is produced as the output of f_ here. My guess is that this difference comes from a set of parameters that is registered with vf (and thus is counted here but is not counted in this line (self.vf_params):

MaxH1996 commented 3 years ago

This is basically the issue I have been trying to work out too (referring to the 3355 difference in parameters). To your points:

Another thing I wanted to ask: I use second derivatives in my neural net. Specifically, my self.Net uses Laplacians. Does this pose a problem for the adjoint method?

MaxH1996 commented 3 years ago

Hey, I was wondering if you had any more thoughts on this issue. I didn't have time in the last couple of weeks to work on it, but I am coming back to it now and still experiencing this mismatch in shapes. I checked the areas where you suggested the differences might come from, but they are the same at these two locations.

Zymrael commented 2 years ago

I'd be happy to take a look at the model if you can share in private. To determine where the issue lies, I would only need access to the nn.Module that determines your input -> output map.

data-hound commented 8 months ago

Hi @Zymrael

I am encountering the same issue. Here is my network, along with the input shape, and how I am creating the NeuralODE:

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(32, 10, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(2)

    def forward(self, x):
        x = self.maxpool(self.relu(self.conv1(x)))
        x = self.maxpool(self.relu(self.conv2(x)))
        x = self.relu(self.conv3(x))
        print('here')
        print(x.shape)
        return x

model = NeuralODE(SimpleCNN())
#Your vector field callable (nn.Module) should have both time `t` and state `x` as arguments, we've wrapped it for you.

t_span = torch.linspace(0,1,100)
t_eval, trajectory = model(next(iter(train_loader))[0], t_span)
trajectory = trajectory.detach()
next(iter(train_loader))[0].shape
#torch.Size([64, 1, 32, 32])

The error message :

RuntimeError: The size of tensor a (8) must match the size of tensor b (32) at non-singleton dimension 3

Semi-Complete stack trace:

here
torch.Size([64, 10, 8, 8])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-41-5705b2264547>](https://localhost:8080/#) in <cell line: 2>()
      1 t_span = torch.linspace(0,1,100)
----> 2 t_eval, trajectory = model(next(iter(train_loader))[0], t_span)
      3 trajectory = trajectory.detach()

6 frames
[/usr/local/lib/python3.10/dist-packages/torchdyn/numerics/utils.py](https://localhost:8080/#) in init_step(f, f0, x0, t0, order, atol, rtol)
     37 def init_step(f, f0, x0, t0, order, atol, rtol):
     38     scale = atol + torch.abs(x0) * rtol
---> 39     d0, d1 = hairer_norm(x0 / scale), hairer_norm(f0 / scale)
     40 
     41     if d0 < 1e-5 or d1 < 1e-5:

RuntimeError: The size of tensor a (8) must match the size of tensor b (32) at non-singleton dimension 3