Open aekiss opened 2 years ago
Restarts, executables and core count are the same. We would expect this to be reproducible, right? Is this what's tested here? https://accessdev.nci.org.au/jenkins/job/ACCESS-OM2/job/reproducibility/
here's the change in my run 588 https://github.com/COSIMA/01deg_jra55_iaf/commit/0f0ec03dce56a17d5e85958d28ac1ceac20efc13 run 589 reverses it https://github.com/COSIMA/01deg_jra55_iaf/commit/9546b24c32437304c7cdadc09d874112c2c500d6 as we can see from this diff between 587 and 589
@adele157 you said
The problem emerges during run 590, there are differences after day 20.
What variables were you comparing? Are the differences undetectable (bitwise) on day 19?
I was just looking (pretty coarsely) at daily temperature output. Not sure how to check for bitwise reproducibility from the output, because I think it's had the precision reduced right?
On Fri, 9 Sept 2022 at 11:07, Andrew Kiss @.***> wrote:
@adele157 https://github.com/adele157 you said
The problem emerges during run 590, there are differences after day 20.
What variables were you comparing? Are the differences undetectable (bitwise) on day 19?
— Reply to this email directly, view it on GitHub https://github.com/COSIMA/access-om2/issues/266#issuecomment-1241387522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44UYSXXSBGCDICFCTFD3V5KEU3ANCNFSM6AAAAAAQIHSLRI . You are receiving this because you were mentioned.Message ID: @.***>
Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision.
I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run?
I did a more thorough check: There are no differences in the daily averaged output of temperature for the first two days. The difference emerges on day 3 (3rd July 1983) and is present thereafter.
On Fri, 9 Sept 2022 at 11:24, Andrew Kiss @.***> wrote:
Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision.
I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run?
— Reply to this email directly, view it on GitHub https://github.com/COSIMA/access-om2/issues/266#issuecomment-1241396494, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44U4PRK4XF6PGW3NN6WLV5KGS7ANCNFSM6AAAAAAQIHSLRI . You are receiving this because you were mentioned.Message ID: @.***>
can you show a plot of the difference on day 3?
There is only regional output for the new run, so this is the whole domain we have to compare. This is the top ocean level, difference on day 3.
It is odd.
So if the differences emerge by day 3 of run 590, then it must be in the restarts from run 589 and yet there is no difference except in the barotropic files.
Possibilities include:
We can't do much about 1.
For 2 I'd be looking at the differences between the barotropic restart files in /scratch/v45/akm157/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers
.
So check differences between restart589/ocean/ocean_barotropic.res.nc
andrestart588/ocean/ocean_barotropic.res.nc
are broadly consistent with the differences between restart588/ocean/ocean_barotropic.res.nc
and restart587/ocean/ocean_barotropic.res.nc
. No really weird signal/corruption.
You don't have the specific restarts any longer, but you could check out what is available in /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3
.
So restart587/ocean/ocean_barotropic.res.nc
is there. You could check it is consistent with Adele's restart587
.
As for 3, well you could try re-running your simulation from your restart587 and see if you can reproduce your own run.
It's in the ice region. Have the ice restarts been checked too?
They're covered by the manifests, and don't show differences AFAICT
I've done a reproducibility test starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3/restart587
here:
https://github.com/COSIMA/01deg_jra55_iaf/tree/01deg_jra55v140_iaf_cycle3_repro_test
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test
Comparing repro test to my original run (01deg_jra55v140_iaf_cycle3
..01deg_jra55v140_iaf_cycle3_repro_test
) we get
git diff -U0 0ab9c24..5072784 manifests/restart.yaml | grep -B 5 md5 | less
: no md5 differences (even in barotropic restarts)git diff -U0 d09ec5e..eed5041 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences in lots of ocean and ice restarts (presumably all of them?) - so I can't reproduce the restarts from 01deg_jra55v140_iaf_cycle3
run 590Comparing Adele's run to repro test (01deg_jra55v140_iaf_cycle3_repro_test
..01deg_jra55v140_iaf_cycle3_antarctic_tracers
) we get
git diff -U0 5072784..dcffbd6 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences only in barotropic and tracer-related ocean restarts (as expected)git diff -U0 eed5041..942fb38 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences only in barotropic and tracer-related ocean restarts - so the non-barotropic restarts from Adele's run 590 are reproducibleSo I think we can conclude
01deg_jra55v140_iaf_cycle3
, affecting the rest of cycle 3, cycle 4, and the extension to cycle 4.Maybe there are clues as to what went wrong in the log files in
/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590
.
Maybe the env.yaml
files differ?
Marshall's comments
Paul L's comment:
Could be relevant to glibc inconsistencies in transcendental functions: https://stackoverflow.com/questions/71294653/floating-point-inconsistencies-after-upgrading-libc-libm
diff /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output589/env.yaml /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590/env.yaml
shows nothing suspicious, but env.yaml
doesn't capture everything
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/how-do-i-start-a-new-perturbation-experiment/262/5
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
We have a second example of non-reproducibility: the re-run 01deg_jra55v140_iaf_cycle4_rerun_from_2002
was identical to the original 01deg_jra55v140_iaf_cycle4
for about half the run (from April 2002 until 2011-07-01) but then differs from run 962 onwards.
In both the original and re-run, run 962 was part of a continuous sequence of runs with no crashes, Gadi shutdown, or manual intervention such as queue changes, timestep changes, or core count changes.
Ideas for possible causes:
We can distinguish 2,3,4 by re-running 961 (starting from restart960, 2011-04-01).
01deg_jra55v140_iaf_cycle4_rerun_from_2002
), there was a glitch in original run; otherwise it's a glitch in re-runThe closest available restart is
/g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959
so would need to run from 2011-01-01 (run 960) anyway.
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
I've done a reproducibility test 01deg_jra55v140_iaf_cycle4_repro_test
in /home/156/aek156/payu/01deg_jra55v140_iaf_cycle4_repro_test
, starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959
.
This reproduces both original and rerun initial condition md5 hashes (other than for ocean_barotropic.res.nc.*
) in manifests/restart.yaml
for runs 960, 961, 962, ruling out a gadi system change in re-run (option 2 above).
For run 963 the manifests/restart.yaml
initial condition md5 hashes from 01deg_jra55v140_iaf_cycle4_repro_test
match the rerun (01deg_jra55v140_iaf_cycle4_rerun_from_2002
), but not the original run (01deg_jra55v140_iaf_cycle4
).
Therefore 01deg_jra55v140_iaf_cycle4
had a non-reproducible glitch in run 962.
This is unfortunate - it means we can't regenerate sea ice data to match the ocean state in 01deg_jra55v140_iaf_cycle4
from 2011-07-01 onwards. 01deg_jra55v140_iaf_cycle4_rerun_from_2002
didn't save any ocean data, so if we want ocean data consistent with the ice data we'll have to re-run this and find somewhere to store it (about 6Tb).
It also means there's a known flaw in 01deg_jra55v140_iaf_cycle4
from 2011-07-01 onwards (and the follow-on run 01deg_jra55v140_iaf_cycle4_jra55v150_extension
), but I expect (although haven't checked) that the initial glitch was a very small perturbation (e.g. an incorrect value in one variable in one grid cell at one timestep), in which case the ocean data we have would still be credible (a different sample from the same statistical distribution in this turbulent flow). We probably should retain this data despite this flaw, as it has been used in publications. This is an analogous situation to the glitch in 01deg_jra55v140_iaf_cycle3
(see above), which affected all subsequent runs.
sea_level
first loses reproducibility on 2011-09-27 simultaneously in both the Arctic and Antarctic (suggesting a CICE or sea ice coupling error). Differences then spread across the globe over the next few days, suggesting a barotropic signal (there may also be some dependence on topography: mid Atlantic ridge apparently shows up on 30 Sept, although not other days). Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
The SST difference also starts as tiny anomalies in both polar regions on 2011-09-27 and then rapidly becomes global. Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb
Note that these plots use single-precision output data, so may be unable to detect the very earliest anomalies in the calculation, which uses double precision.
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
Moving a private Slack chat here.
@adele157 is re-running a section of my
01deg_jra55v140_iaf_cycle3
experiment with extra tracers on branch01deg_jra55v140_iaf_cycle3_antarctic_tracers
here/home/157/akm157/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/
.Her re-run matches my original up to run 590, but not for run 591 and later.
Note that @adele157 has 2 sets of commits for runs 587-610 on branch
01deg_jra55v140_iaf_cycle3_antarctic_tracers
. Ignore the first set - they had the wrong timestep.Differences in md5 hashes in
manifests/restart.yaml
indicate bitwise differences in the restarts. For some reasonocean_barotropic.res.nc
md5 hashes never match, but presumably this is harmless if the other restarts match.Relevant commits (
01deg_jra55v140_iaf_cycle3
..01deg_jra55v140_iaf_cycle3_antarctic_tracers
) aregit diff -U0 0ab9c24..dcffbd6 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences only in barotropic and tracer-related ocean restartsgit diff -U0 d09ec5e..942fb38 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences in lots of ocean and ice restarts (presumably all of them?)So it would seem that something different happened in run 590 so the restarts used by 591 differ. @adele157 re-ran her 590 and the result was the same as her previous run. So that seems to indicate something strange happened with my run 590 (
0ab9c24
). I can't see anything suspicious for run 590 in my run summary. There are changes tomanifests/input.yaml
in my runs 588 and 589, but they don't seem relevant.