[Developer Issue]: Investigate the requirements for FIMS reproducibility

Bai-Li-NOAA commented 1 month ago

Description

A mismatch between FIMS results from local runs and GHA was identified, even when using the same FIMS model, data, and seed (see notes here). The mismatch was due to different versions of tools used locally and on GHA. To ensure FIMS results are reproducible, similar to Stan, we need to ensure the following components are identical: FIMS version, R interface, included library versions, operating system version, computer hardware, and C++ compiler.

Tasks:

[ ] @msupernaw suggests narrowing down the cause of the mismatch to one particular component.
[ ] Document the requirements for FIMS reproducibility in a vignette.

iantaylor-NOAA commented 1 month ago

@Bai-Li-NOAA, thanks for catching this and posting the issue.

In addition to documenting the requirements for reproducibility, I think It would helpful to document the extent of the difference that could be expected when the components aren't identical.

Looking at the expected and actual values from this line the GHA https://github.com/NOAA-FIMS/FIMS/actions/runs/9762956877/job/26947645590#step:7:274 and taking the ratio shows a range of 0.9999947 to 1.0000058. I think that's plenty of precision for any fisheries stock assessment model. SS3 results have always differed to a similar extent among operating systems and it's never been an issue for the production assessments.

Having said that, I understand the problem this poses for our testing framework and for reproducibility in general.

Perhaps the User Guide could include language like "Differences in R interface, included library versions, operating system version, computer hardware, and C++ compiler may lead to differences in results on the order of 1e-5." Perhaps referencing the Stan page makes sense as well. I don't think we should speak to the differences in results between FIMS versions because those might be more extensive depending on what we're changing.

# code to calculate ratio
expected <- c(974415.508459565, 855922.366697973, 665636.943841043, 559681.933274521, 417469.882961596, 364389.958969985, 313543.539361582, 194952.972377838, 166776.416177042)
actual <- c(974417.319788301, 855922.751294695, 665639.068928898, 559682.547638791, 417471.354292516, 364390.381762609, 313544.580739862, 194951.845447807, 166777.297021839)
range(expected/actual)

k-doering-NOAA commented 1 month ago

Just adding that this issue popped up on occasion when running regression tests for SS3, usually due to OS differences (but it does seem logical that all the other components mentioned could have an effect!) For unstable models, the differences could sometimes be large (because the model run would end in a way different optimization).

kellijohnson-NOAA commented 1 month ago

Are the differences before or after optimization (might be worth trying to compare before). It might be the transfer of values from R to C++ interface because of cropped decimal places if the results are small differences.

NOAA-FIMS / FIMS

[Developer Issue]: Investigate the requirements for FIMS reproducibility #649

Description