maxiters: only works with optimization algs in sciml_train?

SciML / DiffEqFlux.jl

Pre-built implicit layer architectures with O(1) backprop, GPUs, and stiff+non-stiff DE solvers, demonstrating scientific machine learning (SciML) and physics-informed machine learning methods

https://docs.sciml.ai/DiffEqFlux/stable

MIT License

864 stars 154 forks source link

maxiters: only works with optimization algs in sciml_train? #598

Closed stephans3 closed 3 years ago

stephans3 commented 3 years ago

I ran the Lotka-Volterra example (optimization_ode) and tested sciml_train with some parameter combinations: with/without optimization algorithm and maxiters.

I found out that:

for sciml_train(loss, p, cb = callback): I got 1302 callback calls
for sciml_train(loss, p, cb = callback, maxiters = 100): I got 202 callback calls
for sciml_train(loss, p, ADAM(0.1), cb = callback, maxiters=100): I got 101 callback calls: maxiters is equal to the number of callback calls as expected :+1:

I noticed the same in the heat equation / PDE example (pde_constrained) with sciml_train(loss, ps, cb = cb, maxiters = 100). I got 108 callback calls and the optimization stopped at a loss below 10^(-20): probably the internal stopping criterion.

Therefore my questions are:

Is maxiters equal to the number of callback calls?
If yes, then for which scenarios does maxiters work?

My configuration:

Julia 1.6.0
DiffEqFlux v1.41.0

ChrisRackauckas commented 3 years ago

BFGS can do multiple f calls per step. That's an Optim thing.

stephans3 commented 3 years ago

Is it clear how many multiple f calls BFGS uses?

So if I use BFGS, is it always: number of cb calls <= 2 x maxiters ??

Or could it be that the number of cb calls is even higher (than 2 times maxiters) which means that I cannot determine the number of cb calls before execution? Then I could not estimate how much time my optimization takes.

I could not find anything (at first glance) about that neither in the GalacticOptim docs nor in the Optim docs?

Beside that, I read about the default optimizer choice in sciml_train: "By default, if the loss function is deterministic than an optimizer chain of ADAM -> BFGS is used [...]"

Does it run like this: ADAM (1 iteration) -> BFGS (1 iteration or more) -> ADAM (1 iteration) -> BFGS (1 iteration or more) -> ...

ADAM (100 iterations) -> BFGS (100 iterations or more) ??

ChrisRackauckas commented 3 years ago

Is it clear how many multiple f calls BFGS uses?

That is dependent on the chosen line search algorithm and the stability of the problem. BFGS with line search can reject steps if the line search fails and such.

Or could it be that the number of cb calls is even higher (than 2 times maxiters) which means that I cannot determine the number of cb calls before execution? Then I could not estimate how much time my optimization takes.

Iterations != f calls in general. Newton methods need Hessian and such, line searches are extra f calls, etc. This is just generally something true about optimization algorithms. Only simple first order methods have one f call per iteration. If you need something with that kind of cost estimate, then you may need to stick to the first order "machine learning" optimizers and stick a maximum iteration count on there.

ADAM (100 iterations) -> BFGS (100 iterations or more) ??

Runs like this. In fact, the latter doesn't have an iteration bound and just runs to convergence.