investigation group idea: how to validate new features

ChristineStawitz-NOAA commented 2 years ago

This came up in discussions for the software design spec #20 - we want to avoid bloat so we should only add features if they are a documented best practice. We need some principles for what constitutes a best practice. Some ideas:

A research study (ideally >1) has proven this is the most accurate approach.
This feature is implemented in a tactical stock assessment modeling platform where >1 real world assessments use it
This feature follows sound statistical principles.

Rick-Methot-NOAA commented 2 years ago

It is good to see attention to this issue. I think that the criteria as described so far may not be strong enough to prevent bloat. (3) is quite weak. (1) is best, especially if it is explicitly designed to compare to other plausible methods.

Another consideration is the formal NOAA R2O transition track per: https://www.noaa.gov/organization/administration/nao-216-105b-policy-on-research-and-development-transitions

Andrea-Havron-NOAA commented 2 years ago

I think we can define specific statistical criteria that need to be met to make (3) more robust (eg. new feature needs to show all parameters are estimable and identifiable with minimal bias)

timjmiller commented 2 years ago

Not so sure it is this easy. The estimability will likely depend on configuration of other parts of the model unrelated to the feature.

On Jan 26, 2022, at 5:09 PM, Andrea-Havron-NOAA @.***> wrote:

I think we can define specific statistical criteria that need to be met to make (3) more robust (eg. new feature needs to show all parameters are estimable and identifiable with minimal bias)

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

Cole-Monnahan-NOAA commented 2 years ago

@timjmiller but then wouldn't that mean the feature is validated, if it only breaks unrelated features break?

As a concrete example, I tried doing self-testing on the double normal selex in SS using Simulation Based Calibration here and it failed really badly. This means that the parameterization is fundamentally incompatible with integration, for at least some configurations. I suspect this would be true for MLEs too, that there'd be large biases in parameters. So, presuming it fails the validation criteria, would we not include it in FIMS? I see a big gray area there. For me, the validation steps are to test for bugs in the code. This is a slightly different issue.

timjmiller commented 2 years ago

Yeah, it might work for some model configurations and not others, but the tests are not going to include the full set of possible configurations. I think if it works in some configurations of interest, then that would be sufficient?

On Thu, Jan 27, 2022 at 11:56 AM Cole Monnahan @.***> wrote:

@timjmiller https://github.com/timjmiller but then wouldn't that mean the feature is validated, if it only breaks unrelated features break?

As a concrete example, I tried doing self-testing on the double normal selex in SS using Simulation Based Calibration here https://arxiv.org/abs/1804.06788 and it failed really badly. This means that the parameterization is fundamentally incompatible with integration, for at least some configurations. I suspect this would be true for MLEs too, that there'd be large biases in parameters. So, presuming it fails the validation criteria, would we not include it in FIMS? I see a big gray area there. For me, the validation steps are to test for bugs in the code. This is a slightly different issue.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-FIMS/FIMS-planning/issues/24#issuecomment-1023437756, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIGN7DEHJ5IBM4XPDZGT73UYF2KXANCNFSM5JIJOJZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Timothy J. Miller, PhD (he, him, his) Research Fishery Biologist NOAA, Northeast Fisheries Science Center Woods Hole, MA 508-495-2365

Cole-Monnahan-NOAA commented 2 years ago

@timjmiller yeah that's what I'm thinking too, that proves it is coded right. Presumably we can setup some data-saturated situations where we've hit asymptotic land and everything should work there too.

Andrea-Havron-NOAA commented 2 years ago

@timjmiller and @Cole-Monnahan-NOAA, I agree. As the complexity of a model grows, it becomes challenging to demonstrate estimability for parameters under all possible combinations of a model, and I don’t think this bar should be a requirement for a feature to be included in FIMS. At the bare minimum, however, a new feature should at least demonstrate feature specific parameter estimability under a suite of examples in which the feature will most likely be used. Ideally, it would be helpful to document cases when estimability fails.

ChristineStawitz-NOAA commented 2 years ago

We might be able to leverage the ROpenSci standards for statistical software - in particular, the time series, spatial, and Bayesian sections.

Cole-Monnahan-NOAA commented 2 years ago

Those are new to me but definitely worth investigating and considering.

Rick-Methot-NOAA commented 2 years ago

I like the idea of having, and archiving, feature specific tests. Making that test demonstrate performance is harder.

In the SS3 testing, out automated testing only tests whether or not existing model configurations get broken by the change.
We rely on the developer to show that the feature computes what it is intended to compute (and document that result in a github comment), but have not yet started archiving those tests; we should.
When we add a feature at request of some user, we rely on the user to do that performance test as they use it in a publication or assessment. We have not found a way to routinely get that documentation into a SS3 specific library of use cases.
Building up our library of standard tests to cover all features and possible interactions of features has not occurred; I think it would take an experienced user many person-months to develop that much expanded library. It might need 100s, rather than the current 10s, of configurations. We have not figured out a smarter plan B yet.

Here's a war story regarding how subtle a difference can be: hake asmt team noticed that new SS3 version gave different correlations in MCMC chain (ask Kelli for details); but new code operating on converged parameters from old code gave identical derived quantities; traced it down to the new version making 3 more iterations (out of ~450) before stopping; differences in parameters and derived quantiies were at e-05 level and passed all of our filters for testing new versions; traced it down to a commit in which there was some re-ordering of some operations. Assumption is that this affected the gradients in a subtle way, which then slightly changed the path to convergence. Our conclusion was to move on with the new code.

NOAA-FIMS / FIMS-planning

investigation group idea: how to validate new features #24