Open alokwarey opened 2 years ago
This should be because of PyTorch Lightning
, could you raise min_epochs
to 100
or so?
Tried that. No change. Training stalled at epoch 30.
Could you share your model (and ideally the args passed to NeuralODE
during init i.e. which solver are you using?) or is the setup the same as the tutorial? This could caused by an unlucky combination of adaptive solver + learning rate + integration time. One quick way to check if the problem is torchdyn
or PyTorch Lightning
related would be to replace the solver with a fixed-step alternative (use "euler" or "rk4") and see if the training stalls again.
Using a fixed step solver "euler" or "rk4" works! I was using tsit5 with Adam(lr = 0.01). I have a follow-up question. How can I solve an IVP or time series problem, but with control variables at each time step?
Instead of training a neural diffeq of the form dy/dt = f(y(t)), I want to train one of the form dy/dt = f(y(t), x(t)). How can I feed in x(t) or control variables at each time step for each mini-batch? Is there an example of a time series prediction problem with control variables?
I've also encountered a similar issue as @alokwarey; changing to a fixed-step integrator seems to alleviate the issue, but as far as I'm aware I can't change parameters associated with those solvers (i.e. min/max step size, etc.)
On a quasi-related note, is it possible to print/log diagnostics for the stiffness of the problem, perhaps between training steps? Since torchdyn
has PyTorch Lightning awareness, one could just expose metrics at a higher level for users to use with PL logging functionality. If this is desirable I could take a crack at it?
I am noticing that the training just stalls/stops at a certain epoch for some reason. No errors/explanation. Screenshot below. It got stuck at this epoch and has been here for the last 2 hours. I am not sure why?