Closed stephans3 closed 3 years ago
BFGS can do multiple f calls per step. That's an Optim thing.
Is it clear how many multiple f calls BFGS uses?
So if I use BFGS, is it always:
number of cb calls <= 2 x maxiters
??
Or could it be that the number of cb calls is even higher (than 2 times maxiters) which means that I cannot determine the number of cb calls before execution? Then I could not estimate how much time my optimization takes.
I could not find anything (at first glance) about that neither in the GalacticOptim docs nor in the Optim docs?
Beside that, I read about the default optimizer choice in sciml_train: "By default, if the loss function is deterministic than an optimizer chain of ADAM -> BFGS is used [...]"
Does it run like this: ADAM (1 iteration) -> BFGS (1 iteration or more) -> ADAM (1 iteration) -> BFGS (1 iteration or more) -> ...
or
ADAM (100 iterations) -> BFGS (100 iterations or more) ??
Is it clear how many multiple f calls BFGS uses?
That is dependent on the chosen line search algorithm and the stability of the problem. BFGS with line search can reject steps if the line search fails and such.
Or could it be that the number of cb calls is even higher (than 2 times maxiters) which means that I cannot determine the number of cb calls before execution? Then I could not estimate how much time my optimization takes.
Iterations != f
calls in general. Newton methods need Hessian and such, line searches are extra f
calls, etc. This is just generally something true about optimization algorithms. Only simple first order methods have one f
call per iteration. If you need something with that kind of cost estimate, then you may need to stick to the first order "machine learning" optimizers and stick a maximum iteration count on there.
ADAM (100 iterations) -> BFGS (100 iterations or more) ??
Runs like this. In fact, the latter doesn't have an iteration bound and just runs to convergence.
I ran the Lotka-Volterra example (optimization_ode) and tested
sciml_train
with some parameter combinations: with/without optimization algorithm andmaxiters
.I found out that:
for
sciml_train(loss, p, cb = callback)
: I got 1302 callback callsfor
sciml_train(loss, p, cb = callback, maxiters = 100)
: I got 202 callback callsfor
sciml_train(loss, p, ADAM(0.1), cb = callback, maxiters=100)
: I got 101 callback calls: maxiters is equal to the number of callback calls as expected :+1:I noticed the same in the heat equation / PDE example (pde_constrained) with
sciml_train(loss, ps, cb = cb, maxiters = 100)
. I got 108 callback calls and the optimization stopped at a loss below 10^(-20): probably the internal stopping criterion.Therefore my questions are:
maxiters
equal to the number of callback calls?maxiters
work?My configuration: