Performance of computing partial derivative

smao-astro commented 3 years ago

Hi,

I am pretty new to neurodiffeq, thank you very much for the excellent library.

I am interested in the way, and the computational speed, of computing partial derivatives w.r.t. the inputs.

Take forward ODE (1D, 1 unknown variable) solver for example, the input is x, a batch of coordinates, and the output of the neural network is y, the approximated solution of the PDE at these coordinates. If view the neural network as a smooth function that simulate the solution and name it f, the forward part in the training is evaluating y = f(x), and for each element of input, x_i, the neural network gives y_i = f(x_i), the i increase from 0 to N-1, the batch size. When constructing the loss function, one evaluate the residual of PDE, which usually require evaluating \frac{\partial y_i}{\partial x_i} and higher order of derivative.

My question related to the way of evaluating the \frac{\partial y_i}{\partial x_i}, for example x is (N, 1) tensor, y is also (N, 1) tensor, N is the batch size, if you do autograd.grad(y, t, create_graph=True, grad_outputs=ones, allow_unused=True) as the lines below https://github.com/odegym/neurodiffeq/blob/718f226d40cfbcb9ed50d72119bd3668b0c68733/neurodiffeq/neurodiffeq.py#L21-L22 my understanding is that it will evaluate a Jacobian Matrix of size (N, N) with elements equal to \frac{\partial y_i}{\partial x_j} (I, j from 0 to N-1) regardless of the fact that y_i only dependent on x_i and thus computation (and storage) on the non-diagonal elements is useless and unnecessary. In other word, the computation actually can be done by evaluating N gradients, but the current method do N * N times.

My question is that:

Is what I state above correct to your understanding?
If correct, do you think this may cause computation speed influence?
If 1 is correct, do you know any way, and do you have any plan to reduce the computation needed?

Thanks!

shuheng-liu commented 3 years ago

Thanks for you interest in neurodiffeq. Your question was very good and thought-provoking, I need to check whether it indeed computes a N x N Jacobian when there's no interdependence in the batch. My intuition is that you are right.

In the meantime, may I ask how you came to know neurodiffeq? It'd be nice to know how we can further promote it :)

smao-astro commented 3 years ago

Hi @shuheng-liu ,

Do you have any update would like to share about the questions, I am not familiar to automatic diff so I would like to know whether the current implementation is efficient.

In addition, I noticed that this nightly new API might helpful, what do you think?

I can not remember the detail since I noticed it earlier last year, maybe from some paper (?)

Thanks.

shuheng-liu commented 3 years ago

computation (and storage) on the non-diagonal elements is useless and unnecessary

I am pretty sure it won't introduce additional storage because PyTorch accumulates (sums) the gradient w.r.t. the same variable. For example

x = torch.rand(10, 1, requires_grad=True)  # shape 10 x 1
y = torch.cat([x, x*2, x*3], 1)            # shape 10 x 3
y.backward(torch.ones_like(y))
print(x.grad)

will give you a 10 x 1 tensor filled with 6s, instead of a 10 x 3 tensor of (1, 2, 3)s

As for the computation part, I'm thinking it will be more expensive though because PyTorch uses reverse mode of automatic differentiation, so when the output has a high dimension, the computation will slow down. I'm not sure if this can be fixed without switching to a different framework other than PyTorch.

torch.vmap looks interesting but I don't have much experience with it. I'm still trying to understand it.

NeuroDiffGym / neurodiffeq

Performance of computing partial derivative #77