Open stevengj opened 1 month ago
Agreed (and I didn't choose those settings). My favorite criterion is abs(f(x+Δx) - f(x)) < f_rel_tol * (abs(f(x + Δx)) + abs(f(x))
, as it is fully scale-invariant. Unfortunately, it can be unsatisfying for functions that achieve f(zeros(d)) == 0
at the minimum: due to the compression of floating-point numbers near zero, it will iterate until you start getting into the subnormals. Short of an absolute tolerance on f
(which is sensitive to the scale of f
) or a tolerance on x
or g
(both of which are dimensionful), I'm not aware of a good solution.
Yeah, in practice many people don't realize that problem-specific tuning of tolerances is something you should almost always do.
If you're going to have a default, it should be relative tolerances, even though these can be insufficient upon occasion as well (or default to no tolerances, and throw an error if the user doesn't choose a stopping criterion). Maybe you can't change this now for backwards-compatibility reasons.
But in almost all of the manual's examples, no tolerance is passed, which is misleading. This, at least, can be fixed. And I wish there was a built-in way to set the gradient tolerance relative to the initial gradient.
Thank you for this discussion, and I agree with everything said so far. I don't remember the history of the default tolerance, and it doesn't matter if it was me or someone else.
I think this is actually a surprisingly hard truth about numerical optimization, that I also have difficulty handling in my profession life. I mean, when doing work myself, but also when communicating with customers.
Users and customers want to know if the procedure converged, but it's actually quite hard to say anything meaningfully about in general, because we have no natural scale to accept "0" at. Not converged? Just multiply the objective with 0.0001 and suddenly you "converged".
A relative tolerance can be hard on optimization runs that actually start close to the solution but has some level of noise in the objective calculations (say an embedded dynamical system that is or can only be solved up to some finite tolerance).
So, just ask the user, right? The problem is many users also have no idea what to set these values at, so depending on the use case they may set it to a level that actually converges such that they can show that their model is "good".
A bad default is not necessarily better than no default - that's not what I'm trying to argue - but no default is also very hard on some users. Your relative tolerance may be better here, because it says that you want a certain improvement over the starting point, but from my own experience in nonlinear mixed effects modeling you can often start with gradient norms that are 1e10 or 1e11 and you would then stop at 1e3 which again from my experience is "not converged". Then what, you restart again? Now you require 1e-5 which may not be attainable in practise.
I accept your comment that we can never absolutely say that the largest gradient is small enough such that we can confidently say we converged. The most accurate thing we can promise is to say we started at a point where the function was "this steep" and we ended at a place where it was "this much flatter" - or alternatively just not stop by the gradient norm at all, and just let the algorithm run forever (or until maxiter) until the algorithm can no longer make progress and stop. Then, the user has to decide if they're happy (however they want to do that).
You set an absolute tolerance of
g_tol = 1e-8
by default. I don't know how you picked this — it depends on the scaling of the problem, because gradients are dimensionful.This implicitly assumes that the function is scaled so that the typical derivatives are of order unity. This makes it pretty dangerous to use the default tolerances, in my opinion.
A safer default would have been to set a relative tolerance
g_reltol
on the gradient, e.g. by settingg_tol = g_reltol * initial_g
.At the very least, the manual should highlight this implicit scaling assumption, and let people know that they are likely to need a problem-specific
g_tol
. (Also for gradient-free methods, whereg_tol
means who-knows-what.)