COSIMA / access-om2

ACCESS-OM2 global ocean - sea ice coupled model configurations.
20 stars 23 forks source link

non-reproducible runs #266

Open aekiss opened 2 years ago

aekiss commented 2 years ago

Moving a private Slack chat here.

@adele157 is re-running a section of my 01deg_jra55v140_iaf_cycle3 experiment with extra tracers on branch 01deg_jra55v140_iaf_cycle3_antarctic_tracers here /home/157/akm157/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/.

Her re-run matches my original up to run 590, but not for run 591 and later.

Note that @adele157 has 2 sets of commits for runs 587-610 on branch 01deg_jra55v140_iaf_cycle3_antarctic_tracers. Ignore the first set - they had the wrong timestep.

Differences in md5 hashes in manifests/restart.yaml indicate bitwise differences in the restarts. For some reason ocean_barotropic.res.nc md5 hashes never match, but presumably this is harmless if the other restarts match.

Relevant commits (01deg_jra55v140_iaf_cycle3..01deg_jra55v140_iaf_cycle3_antarctic_tracers) are

So it would seem that something different happened in run 590 so the restarts used by 591 differ. @adele157 re-ran her 590 and the result was the same as her previous run. So that seems to indicate something strange happened with my run 590 (0ab9c24). I can't see anything suspicious for run 590 in my run summary. There are changes to manifests/input.yaml in my runs 588 and 589, but they don't seem relevant.

aekiss commented 2 years ago

Restarts, executables and core count are the same. We would expect this to be reproducible, right? Is this what's tested here? https://accessdev.nci.org.au/jenkins/job/ACCESS-OM2/job/reproducibility/

aekiss commented 2 years ago

here's the change in my run 588 https://github.com/COSIMA/01deg_jra55_iaf/commit/0f0ec03dce56a17d5e85958d28ac1ceac20efc13 run 589 reverses it https://github.com/COSIMA/01deg_jra55_iaf/commit/9546b24c32437304c7cdadc09d874112c2c500d6 as we can see from this diff between 587 and 589

aekiss commented 2 years ago

@adele157 you said

The problem emerges during run 590, there are differences after day 20.

What variables were you comparing? Are the differences undetectable (bitwise) on day 19?

adele-morrison commented 2 years ago

I was just looking (pretty coarsely) at daily temperature output. Not sure how to check for bitwise reproducibility from the output, because I think it's had the precision reduced right?

On Fri, 9 Sept 2022 at 11:07, Andrew Kiss @.***> wrote:

@adele157 https://github.com/adele157 you said

The problem emerges during run 590, there are differences after day 20.

What variables were you comparing? Are the differences undetectable (bitwise) on day 19?

— Reply to this email directly, view it on GitHub https://github.com/COSIMA/access-om2/issues/266#issuecomment-1241387522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44UYSXXSBGCDICFCTFD3V5KEU3ANCNFSM6AAAAAAQIHSLRI . You are receiving this because you were mentioned.Message ID: @.***>

aekiss commented 2 years ago

Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision.

I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run?

adele-morrison commented 2 years ago

I did a more thorough check: There are no differences in the daily averaged output of temperature for the first two days. The difference emerges on day 3 (3rd July 1983) and is present thereafter.

On Fri, 9 Sept 2022 at 11:24, Andrew Kiss @.***> wrote:

Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision.

I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run?

— Reply to this email directly, view it on GitHub https://github.com/COSIMA/access-om2/issues/266#issuecomment-1241396494, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44U4PRK4XF6PGW3NN6WLV5KGS7ANCNFSM6AAAAAAQIHSLRI . You are receiving this because you were mentioned.Message ID: @.***>

aekiss commented 2 years ago

can you show a plot of the difference on day 3?

adele-morrison commented 2 years ago

There is only regional output for the new run, so this is the whole domain we have to compare. This is the top ocean level, difference on day 3.

Screen Shot 2022-09-09 at 12 29 44 pm
aidanheerdegen commented 2 years ago

It is odd.

So if the differences emerge by day 3 of run 590, then it must be in the restarts from run 589 and yet there is no difference except in the barotropic files.

Possibilities include:

  1. differences in something not captured in the manifests
  2. actual differences in the barotropic restarts, but they're masked by there always being differences in the md5 sums
  3. weird random glitch in your run Andrew, that hasn't affected Adele

We can't do much about 1.

For 2 I'd be looking at the differences between the barotropic restart files in /scratch/v45/akm157/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers.

So check differences between restart589/ocean/ocean_barotropic.res.nc andrestart588/ocean/ocean_barotropic.res.nc are broadly consistent with the differences between restart588/ocean/ocean_barotropic.res.nc and restart587/ocean/ocean_barotropic.res.nc. No really weird signal/corruption.

You don't have the specific restarts any longer, but you could check out what is available in /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3.

So restart587/ocean/ocean_barotropic.res.nc is there. You could check it is consistent with Adele's restart587.

As for 3, well you could try re-running your simulation from your restart587 and see if you can reproduce your own run.

russfiedler commented 2 years ago

It's in the ice region. Have the ice restarts been checked too?

aidanheerdegen commented 2 years ago

They're covered by the manifests, and don't show differences AFAICT

aekiss commented 1 year ago

I've done a reproducibility test starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3/restart587 here: https://github.com/COSIMA/01deg_jra55_iaf/tree/01deg_jra55v140_iaf_cycle3_repro_test /home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test

Comparing repro test to my original run (01deg_jra55v140_iaf_cycle3..01deg_jra55v140_iaf_cycle3_repro_test) we get

Comparing Adele's run to repro test (01deg_jra55v140_iaf_cycle3_repro_test..01deg_jra55v140_iaf_cycle3_antarctic_tracers) we get

So I think we can conclude

  1. runs are normally reproducible, including barotropic restarts (not sure how Adele's barotropic restarts got altered, but it has no impact on the rest of the model)
  2. something went wrong in run 590 of 01deg_jra55v140_iaf_cycle3, affecting the rest of cycle 3, cycle 4, and the extension to cycle 4.
aekiss commented 1 year ago

Maybe there are clues as to what went wrong in the log files in /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590.

aekiss commented 1 year ago

Maybe the env.yaml files differ?

aekiss commented 1 year ago

Marshall's comments

aekiss commented 1 year ago

Paul L's comment:

Could be relevant to glibc inconsistencies in transcendental functions: https://stackoverflow.com/questions/71294653/floating-point-inconsistencies-after-upgrading-libc-libm

aekiss commented 1 year ago
diff /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output589/env.yaml /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590/env.yaml

shows nothing suspicious, but env.yaml doesn't capture everything

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-do-i-start-a-new-perturbation-experiment/262/5

access-hive-bot commented 11 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/1

aekiss commented 11 months ago

We have a second example of non-reproducibility: the re-run 01deg_jra55v140_iaf_cycle4_rerun_from_2002 was identical to the original 01deg_jra55v140_iaf_cycle4 for about half the run (from April 2002 until 2011-07-01) but then differs from run 962 onwards.

In both the original and re-run, run 962 was part of a continuous sequence of runs with no crashes, Gadi shutdown, or manual intervention such as queue changes, timestep changes, or core count changes.

Ideas for possible causes:

  1. gadi system change in original run - ruled out because re-run was identical to original for about 10 years
  2. gadi system change in re-run - ruled out by test below
  3. random glitch in original run
  4. random glitch in re-run - ruled out by test below

We can distinguish 2,3,4 by re-running 961 (starting from restart960, 2011-04-01).

The closest available restart is

/g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959

so would need to run from 2011-01-01 (run 960) anyway.

access-hive-bot commented 11 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/5

aekiss commented 11 months ago

I've done a reproducibility test 01deg_jra55v140_iaf_cycle4_repro_test in /home/156/aek156/payu/01deg_jra55v140_iaf_cycle4_repro_test, starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959.

This reproduces both original and rerun initial condition md5 hashes (other than for ocean_barotropic.res.nc.*) in manifests/restart.yaml for runs 960, 961, 962, ruling out a gadi system change in re-run (option 2 above).

For run 963 the manifests/restart.yaml initial condition md5 hashes from 01deg_jra55v140_iaf_cycle4_repro_test match the rerun (01deg_jra55v140_iaf_cycle4_rerun_from_2002), but not the original run (01deg_jra55v140_iaf_cycle4).

Therefore 01deg_jra55v140_iaf_cycle4 had a non-reproducible glitch in run 962.

This is unfortunate - it means we can't regenerate sea ice data to match the ocean state in 01deg_jra55v140_iaf_cycle4 from 2011-07-01 onwards. 01deg_jra55v140_iaf_cycle4_rerun_from_2002 didn't save any ocean data, so if we want ocean data consistent with the ice data we'll have to re-run this and find somewhere to store it (about 6Tb).

It also means there's a known flaw in 01deg_jra55v140_iaf_cycle4 from 2011-07-01 onwards (and the follow-on run 01deg_jra55v140_iaf_cycle4_jra55v150_extension), but I expect (although haven't checked) that the initial glitch was a very small perturbation (e.g. an incorrect value in one variable in one grid cell at one timestep), in which case the ocean data we have would still be credible (a different sample from the same statistical distribution in this turbulent flow). We probably should retain this data despite this flaw, as it has been used in publications. This is an analogous situation to the glitch in 01deg_jra55v140_iaf_cycle3 (see above), which affected all subsequent runs.

aekiss commented 11 months ago

sea_level first loses reproducibility on 2011-09-27 simultaneously in both the Arctic and Antarctic (suggesting a CICE or sea ice coupling error). Differences then spread across the globe over the next few days, suggesting a barotropic signal (there may also be some dependence on topography: mid Atlantic ridge apparently shows up on 30 Sept, although not other days). Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb

download

access-hive-bot commented 11 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/8

aekiss commented 11 months ago

The SST difference also starts as tiny anomalies in both polar regions on 2011-09-27 and then rapidly becomes global. Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb

download-1

aekiss commented 11 months ago

Note that these plots use single-precision output data, so may be unable to detect the very earliest anomalies in the calculation, which uses double precision.

access-hive-bot commented 11 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/12

access-hive-bot commented 10 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/16