Closed Cole-Monnahan-NOAA closed 3 years ago
Nice work @Cole-Monnahan-NOAA! I can't speak to the implementation questions, but as an ADMB user I would be happy to have this option available, even with no further improvements over what you've done so far.
Here are two questions I have about the performance of this option that could be tested using your fork.
-crit N
or setting convergence_criteria
in the code based on some user setting)?Regarding your list of things to consider, here are my thoughts:
Even if I set the criterion to like 1E-10 the optimizer never gets there, including if I set maxfn very high. I'm not sure why it kicks out before attaining that. I suspect there is a hard-coded internal criterion. Regardless, in some (many?) cases this hess_step will not be needed. But when there are large differences in marginal scale and high correlations among parameters then the optimizer will converge very slowly to the mode. That is the use-case of this feature. To be clear it is expensive because it requires calculating and inverting the Hessian again (and calculating the SD report stuff).
From what I've seen so far, there is no change to the inference of a model in either the parameters or the covariance matrix. And I wouldn't expect it either. The updates to the parameters are miniscule. The point is more related to convergence checks. If your model can't get below say maxgrad 0.001 and you try one of these steps and it fails miserably, then I would be highly suspicious the optimizer is at a mode. I feel like in the stock assessment world are always wondering about convergence. In my opinion small gradients and invertible Hessians are not reliable proof of convergence (because I've seen counterexamples). I contend this hess_step approach is possibly the most reliable evidence of convergence. If they get you arbitrarily small gradients this means that the inverse Hessian has the "right" geometry and the mode is real. That's my interpretation at least.
For now what I did is a really simple proof of concept. If there is interest I or someone else could explore this on a wider set of models to see what happens, like those with things like MLEs on bounds or other poor parameterization situations.
I agree that this seems useful! In my opinion, using Newton steps (approximating the curvature around the MLE as if it were quadratic) to improve the gradient is a good corroboration that the log-likehlihood really is quadratic in the vicinity of the MLE (as assumed by asymptotic standard error calculations).
Perhaps more importantly, I remember the PFMC Terms of Reference for stock assessments requiring that assessors list the maximum absolute final gradient (presumably as a "sanity check") and there was some skepticism about models with a max-abs-gradient >0.01, but where this could happen in otherwise reasonable models due to complexities induced by highly nonlinear selectivity options, etc. So in practice, having this option would allow assessors to eliminate that additional source of stress during assessment review panels, e.g., eliminate a case where a model had a max-abs-gradient around 0.001 prior to STAR panel, but where it gets worse due to changes during the STAR panel.
I think that the sablefish stock assessment model would be an ideal place to test this because we had trouble with convergence along the way to a base model. Reviewers were curious why the model wasn't converging when minor changes were made. It seemed unpredictable what was going to throw the model off. I wonder if the actual base model would pass the "new" test. I am busy with hake for the next two months, but I am interested in helping test the method, if it can wait until March.
If you're able to share the files I can test it on my end. I'll email separately to coordinate.
Thanks to @k-doering-NOAA for running this test on a suite of stock synthesis models. The results were pretty good. 21/30 models improved, many by multiple orders of magnitude. 6/30 resulted in a worse maximum gradient. Investigation of one of these found a parameter estimated on the bound (natural mortality), and when that parameter was fixed the hess_step worked. The remaining 3/30 crashed for unknown reasons which appear to depend on OS and need further investigation.
So I think the evidence is pretty strong that this feature could be useful. I see no downsides to including it, as it has no impact on any other part of the code and should not break backwards compatibility. I therefore propose merging it in to the next release.
Two final ideas I have are (1) Let the user pass the number of steps to do, defaulting to 1. So model -hess_step 5
would run 5 of the above steps, configured in such a way to minimize overhead of the SD calculations, defaulting to 1, and exiting if the maxgrad gets below a threshold like 1E-12. (2) When the step fails, print which parameters ended with a larger gradient which could help diagnose issues.
Sounds like a great option to be able to employ. Your two final ideas sound like very worthwhile approaches to implementation. Might as well make the threshold a user option, as well (with default of 1E-12, perhaps).
@wStockhausen Good idea. I went ahead and added this capability so the user calls it with
model -hess_step N -hess_step_tol eps
Where N=1 and eps=1e-12 are the defaults. I also added a feature to print which parameters are worse off, which may help to identify issues. The code avoids doing the sd_report calculations until the very last step as that should minimize computation time.
Here's how it behaves on a few models:
The catage example model
Experimental feature to take up to 10 Newton step(s) using the inverse Hessian
Initial max gradient=5.3001e-05 and min gradient= 3.00353e-07
Step 1: Updated max gradient=2.19925e-11 and min gradient= 7.67905e-15
... [truncated]
Step 2: Max gradient 0 below threshold of 1e-12 so exiting early
... [truncated]
The 2 Hessian step(s) reduced maxgrad from 5.3001e-05 to 0 which is inf times smaller so kept it
A stock assessment where it works great. Output cleaned up for clarity.
Experimental feature to take up to 10 Newton step(s) using the inverse Hessian
Initial max gradient=4.34726e-05 and min gradient= 1.12029e-07
Step 1: Updated max gradient=1.73194e-07 and min gradient= 6.81857e-13
Step 2: Max gradient 0 below threshold of 1e-12 so exiting early
The 2 Hessian step(s) reduced maxgrad from 4.34726e-05 to 0 which is inf times smaller so kept it
A stock assessment where it fails:
Experimental feature to take up to 10 Newton step(s) using the inverse Hessian
Initial max gradient=0.00681323 and min gradient= 1.48014e-08
Step 1: Worse gradient so abandoning. Consider reoptimizing model to reset files
Par 2: original grad=8.5597e-06 and updated grad= -0.000144485
Par 3: original grad=3.53257e-06 and updated grad= -7.67114e-05
Par 4: original grad=2.39255e-05 and updated grad= 0.000635574
... [truncated]
This model has a parameter (natural mortality; first parameter) stuck on a bound with tiny SE. I fixed it at the MLE and reran and now the hess_step works:
The 2 Hessian step(s) reduced maxgrad from 1.78413e-05 to 0 which is inf times smaller so kept it
I thought maybe that troublesome parameter would be the one with the worse gradient, but it wasn't. In any case I think that this is a nice example of how this feature is useful. It failed b/c the Hessian was not quadratic (parameter on bound). It was fairly easy to see why, and once fixed the hess_step worked.
I'll move forward with a pull request after cleaning up a bit and review by @johnoel . I hope to have this released in 12.3.
@Cole-Monnahan-NOAA, just reviewed the hess_step branch and propose two changes. The change in revision dd83d25 keeps the code readable by avoiding a call to ::computations1 from another function. The final output from hess_step will change to before ::computations1 instead of after. If it is better to have the output after, then a return for hess_step can be developed. There are some parts to hess_step function that can be improved, but I left it alone since it is your code. Let me know if the changes are acceptable. See branch issue160.
@johnoel I think it is needed to do the call to computations1
. The hess_step feature only needs the inverse Hessian to do the calculation. Thus after each step I need to update that inverse Hessian which is what is happening in this line: https://github.com/Cole-Monnahan-NOAA/admb/blob/add_hess_step/src/nh99/hmc_functions.cpp#L180
However, if this was the last iteration then we need to do more than invert the Hessian. We also need to do the SD report calcs and some other stuff that happens internally that I don't follow. The easiest way I know to do this, is to optimize the model but not let it take any steps, hence the code
function_minimizer::maxfn=0;
computations1(ad_comm::argc,ad_comm::argv);
This fakes out optimization, calculates the Hessian, inverts it, and does all the rest of the stuff needed. This is a common technique from the command line too model -maxfn 0
. I tried avoiding this by calling the SD calcs function directly but it failed to work right. So this was my shortcut. I think it works pretty well b/c the overhead is very small.
Now, why not call it after returning from the function e.g., here: https://github.com/Cole-Monnahan-NOAA/admb/blob/add_hess_step/src/nh99/modspmin.cpp#L52 ? Well, because so much stuff gets outputted to the console that if I called it after the function hess_step
the final results would be lost. By calling it inside the function I can then have access to the key variables and can print this like https://github.com/Cole-Monnahan-NOAA/admb/blob/add_hess_step/src/nh99/hmc_functions.cpp#L192.
I suppose one option would be to make variables maxgrad and maxgrad2 arguments passed by reference and define/update them in modspmin.cpp. Then basically the following lines could be moved from the function to the modspmin.cpp file after where hess_step
gets called:
// Finished successully so run the opitmizer without
// taking any steps to produce updated output files
// with this new MLE value.
function_minimizer::maxfn=0;
computations1(ad_comm::argc,ad_comm::argv);
cout << "The " << Nstep << " Hessian step(s) reduced maxgrad from " <<
maxgrad0 << " to " << maxgrad2 << " which is " <<
maxgrad0/maxgrad2 << " times smaller so kept it" << endl;
Okay @Cole-Monnahan-NOAA, reverted back to the previous call to ::computations1 from hess_step. Please review c2306dd. If you agree, I will merge issue160 branch that was initially copied from hess_step.
That looks fine to me Johnoel. I wasn't able to test whether my documentation of hess_step
was done right so please check that also.
Okay @Cole-Monnahan-NOAA check the minor edits. If they are okay, I will merge issue160 branch then close pull #169.
Merge away. Thanks!
@Cole-Monnahan-NOAA, can you check revision 94defcc if that was the output you needed?
Merged in Pull #182
Often the optimizer gets to a maxgrad of 1E-05 and quits because that's close enough, but if run for longer cannot appreciably decrease this max grad. I think in many cases it would be interesting or advisable to get closer to the mode to either improve uncertainty estimates (a better Hessian) or to confirm convergence of the model.
One approach used in other contexts is to take Newton steps using the information in the Hessian. Basically, given the MLE found by the optimizer you take MLE=MLE-(inverse Hessian)(gradient), where the Hessian and gradient are calculated from that spot. This value MLE* then in theory is much closer to the mode, but it has a high cost due to the extra matrix calculations.
I prototyped this in my fork here: https://github.com/Cole-Monnahan-NOAA/admb/tree/add_hess_step . The way it works is the user runs the model like usual, then after optimization runs
model -hess_step
which then does this calculation internally, and prints information to the screen. If the step results in a worse maxgrad then it abandons.I ran this on the packaged models, and all of them showed great improvement in the final gradients, as follows.
finance : The Hessian step resulted in a maxgrad 35593.3 times smaller so kept it catage: The Hessian step resulted in a maxgrad 2.40996e+06 times smaller so kept it buscycle: The Hessian step resulted in a maxgrad 1957.05 times smaller so kept it chem_eng: The Hessian step resulted in a maxgrad 7041.82 times smaller so kept it forest: The Hessian step resulted in a maxgrad 1.06964e+07 times smaller so kept it pella_t: The Hessian step resulted in a maxgrad 6478.17 times smaller so kept it robreg: The Hessian step resulted in a maxgrad 274442 times smaller so kept it truncreg: The Hessian step resulted in a maxgrad 2.43623e+06 times smaller so kept it vol: The Hessian step resulted in a maxgrad 195764 times smaller so kept it
All of these got to like E-08 or less with a single step. I also tested it on an SS model, in this case v3.30.15 for my BSAI flathead sole assessment. Initial maxgrad is about E-05.
The Hessian step resulted in a maxgrad 251.006 times smaller so kept it run again and The Hessian step resulted in a maxgrad 2.37764 times smaller so kept it and again The Hessian step resulted in a maxgrad 10.9362 times smaller so kept it and again The Hessian step resulted in a maxgrad 5.15986 times smaller so kept it
Now it's down to about E-09 after like 4-5 steps (just rerunning the same command over and over). This is much slower convergence but I think shows that if I wrapped a loop around this we could enter like
-hess_step N
and do N steps. It takes about a minute per step on my laptop (this is a CAAL model so slow). These tests show that this approach does indeed work. Not a single time has the gradient gotten worse. Also, it works for a variety of models including those with phased estimation and inactive parameters.So I think this would be a useful addition to the ADMB source code, and would like to see it as an option for future releases. The question is how to optimally integrate it into the source code, and improve it so it is robust across all models.
Some things to consider are:
computations1
function so that this step happens during the standard optimization, instead of as a distinct step. This would reduce overhead of SD calcs and such.The way I've coded it does not lend to being used during optimization, only at the end. That could be an option as well, but is beyond the scope of this issue.
I'm curious to see what people think of this idea. Is it a useful addition? Is there a better way to do this? How to best implement it?