E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
76 stars 54 forks source link

Potential non-bfb issue with scream master branch on Frontier at ne1024 #2685

Open ndkeen opened 8 months ago

ndkeen commented 8 months ago

Since the work to get scream master working on Frontier, I've been testing out various cases. I noticed that some cases were not always BFB. So far, I have not found any ne30 or ne256 cases to be non-bfb. And most ne1024 are indeed BFB as I try different things (such as node count, output, etc).

It was suggested to turn on more verbose hash checking in two cases. First I tried this with two identical cases at 320 nodes and while the cases ran out of time (as adding this hash logging is more expensive), they were BFB until the end. But then tried a 384 node case and I see a diff somewhere between step 0 and step 1.

1205c1269
<    0: exxhash> 2019-212.00000 1 a4c8e8fbac0bfa70 (spa-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 a4c8e8fbac0a0de4 (spa-pst-sc-0)
1207,1209c1271,1273
<    0: exxhash> 2019-212.00000 0 12b02b356ba2fcff (p3-pre-sc-0)
<    0: exxhash> 2019-212.00000 0 f6c5e8dc5399ba10 (p3-pst-sc-0)
<    0: exxhash> 2019-212.00000 1 f8725316e5ec6dce (p3-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 0 12b02b356ba31aea (p3-pre-sc-0)
>    0: exxhash> 2019-212.00000 0 f6c5e8dc5399c7c4 (p3-pst-sc-0)
>    0: exxhash> 2019-212.00000 1 f8725316e5ebb7ff (p3-pst-sc-0)
1211,1212c1275,1276
<    0: exxhash> 2019-212.00000 0 1539437cbe0ad944 (mac_aero_mic-pst-sc-0)
<    0: exxhash> 2019-212.00000 1  ea47f38f11906d5 (mac_aero_mic-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 0 1539437cbe0ab886 (mac_aero_mic-pst-sc-0)
>    0: exxhash> 2019-212.00000 1  ea47f38f11654a1 (mac_aero_mic-pst-sc-0)

To get more verbose hashing:

    ./atmchange BfbHash=1
    ./atmchange --all internal_diagnostics_level=1 atmosphere_processes::internal_diagnostics_level=1
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/bartgol_eamxx_share-horiz-remap-data/tcess-control.ne1024pg2_ne1024pg2.F2010-SCREAMv1.bartgol_eamxx_share-horiz-remap-data.n0320t08x8.nr.nohist.odef.S0.cfix.ndag.hh

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/bartgol_eamxx_share-horiz-remap-data/tcess-control.ne1024pg2_ne1024pg2.F2010-SCREAMv1.bartgol_eamxx_share-horiz-remap-data.n0384t08x8.nr.nohist.odef.S0.cfix.ndag.hh

Andrew B notes:

The diff points to SPA. 'pst' means "post". So 'pre' is fine and 'pst' is bad, meaning SPA is the issue. `sc-N` is subycle N
ndkeen commented 8 months ago

The above was with bartgol/eamxx/share-horiz-remap-data checked out on Jan24. Trying same two setup with a scream master of Jan23, I get the same diffs -- ie diverges at same place and has same values. Which is maybe not surprising, but comforting.

<    0: exxhash> 2019-212.00000 1 a4c8e8fbac0bfa70 (spa-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 a4c8e8fbac0a0de4 (spa-pst-sc-0)

I also ran a couple of 2-month long ne30 cases, each with different number of nodes. Both are BFB. As well as multiple 5-day cases at ne256 -- all BFB

ndkeen commented 8 months ago

I had incorrectly assumed that a previous checkout was BFB. Oddly, there is indeed non-bfb diffs in the hash prints, but there are only 3 of them -- which seems odd. How it possible to be non-bfb at one point and then "recover" and be BFB the remaining steps? These are the only diffs in the file after 18 steps.

<    0: exxhash> 2019-212.00000 1 bb91d5d1f501b850 (cosp-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 d08d55ade701b6a6 (cosp-pst-sc-0)
1287c1223
<    0: exxhash> 2019-212.00000 1 2f8248e7f33f554f (physics-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 447dc8c3e53f53a5 (physics-pst-sc-0)
1294c1230
<    0: exxhash> 2019-212.00000 1 a6abe3f663eb3313 (EAMxx-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 bba763d255eb3169 (EAMxx-pst-sc-0)

So thats the diff from 2 cases using Jan 17th checkout. I see same behavior with a checkout of b2024-01-18-PR2668-6167f97fee -- or Jan 19th, just after PR 2668.

Additionally, that I think there may be a different flavor of non-bfb as I still have cases that are non-bfb in the Jan17th checkout and even before. However, I don't think they are as easy to trip as the one above.

If it makes sense to look at changes between the two checkout dates, here are some PR's:

0943b6a57c 2024-01-23 13:02:00 -0700 Merge pull request #2655 from E3SM-Project/oksanaguba/eamxx/expose2
4c1ce5cf31 2024-01-23 10:33:13 -0700 Merge pull request #2646 from E3SM-Project/bartgol/eamxx/spa-use-horiz-interp-remapper
6167f97fee 2024-01-18 18:05:36 -0700 Merge pull request #2668 from E3SM-Project/bartgol/io/avg-cnt-updates
9f3b148d76 2024-01-15 10:15:05 -0700 Merge Pull Request #2673 from E3SM-Project/scream/mahf708/nudging/fix-grid-error
8f126455ff 2024-01-15 10:13:52 -0700 Merge Pull Request #2672 from E3SM-Project/scream/2671-scorpio-interface-doesnt-seem-to-handle-reading-scalar0d-fields
98ba276cad 2024-01-15 10:12:39 -0700 Merge Pull Request #2670 from E3SM-Project/scream/2669-is_valid_layout-logic-cannot-handle-scalar0d-in-practice
cc36493037 2024-01-11 13:06:15 -0700 Merge Pull Request #2665 from E3SM-Project/scream/bartgol/fix-avg-cnt-in-io
1fa8a1b738 2024-01-11 13:05:02 -0700 Merge Pull Request #2664 from E3SM-Project/scream/bartgol/fix-accumulated-fields-for-io
9a1e219580 2024-01-10 14:52:51 -0700 Merge Pull Request #2660 from E3SM-Project/scream/bartgol/field-layout-ctor-bugfix
32d421f615 2024-01-10 13:40:47 -0700 Merge Pull Request #2631 from E3SM-Project/scream/tcclevenger/perturb_field_util
64114dd1d5 2024-01-10 10:34:13 -0700 Merge Pull Request #2659 from E3SM-Project/scream/bartgol/eamxx/property-check-perf-fix
f2f217807a 2024-01-09 10:54:23 -0700 Merge Pull Request #2647 from E3SM-Project/scream/bartgol/grid-cache-is-unique
e7b9f5f9a8 2024-01-09 09:25:54 -0700 Merge pull request #2654 from E3SM-Project/jgfouca/frontier_fixes
2f9def864a 2024-01-02 12:39:15 -0700 Merge pull request #2649 from E3SM-Project/ambrad/eamxx/fmad-adjust
c5b7c31c76 2023-12-29 14:22:30 -0700 Merge pull request #2648 from E3SM-Project/ambrad/eamxx/pm-gpu-fmad-try
bab860e409 2023-12-22 13:19:18 -0700 Merge Pull Request #2642 from E3SM-Project/scream/bartgol/eamxx/fix-vertical-remap-constructor
4d09464c67 2023-12-22 12:41:26 -0700 Merge Pull Request #2644 from E3SM-Project/scream/bartgol/field-from-pre-existing-view
4989dad376 2023-12-20 16:12:41 -0700 Merge Pull Request #2643 from E3SM-Project/scream/elynn/enable-ruby-ML-run
35d8f38815 2023-12-20 08:52:45 -0700 Merge pull request #2636 from E3SM-Project/bartgol/active-gases-pg2-and-restart-fixes
e901950465 2023-12-19 13:54:59 -0700 Merge pull request #2640 from E3SM-Project/bartgol/eamxx/namelist-defaults-append-keyword
e5d290c4f1 2023-12-18 16:11:30 -0700 Merge Pull Request #2641 from E3SM-Project/scream/ambrad/eamxx/ascent-fmad-workaround
ambrad commented 8 months ago

How it possible to be non-bfb at one point and then "recover" and be BFB the remaining steps? These are the only diffs in the file after 18 steps.

It looks like COSP might be nondeterministic. However, (1) its outputs are purely diagnostic, so they can't affect the simulation and (2) it appears only once, suggesting uninitialized fields in COSP are the issue, not persistent nondeterminism. These two together probably answer your question.

bartgol commented 8 months ago

@ndkeen I don't know anything about COSP, so I can't tell if that's related. But your first msg showed SPA being part of the problem. I verified SPA was also the issue in our nightly PEM tests. I'm relatively confident PR #2691 will fix PES variability in SPA. Assuming COSP is just a red herring, that PR may also fix this issue. Either way, you may want to give that PR fix a try.

ndkeen commented 8 months ago

Adding that 3-line sort fix Luca found above does seem to help here. When I run the same 18-step tests of 2 cases (320 and 384 nodes on frontier), I no longer see all of those non-bfb hashes. However, still oddly, we see 3 values are non-bfb:

<    0: exxhash> 2019-212.00000 1 bb91d5d1f501b850 (cosp-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 d08d55ade701b6a6 (cosp-pst-sc-0)
1287c1223
<    0: exxhash> 2019-212.00000 1 2f8248e7f33f554f (physics-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 447dc8c3e53f53a5 (physics-pst-sc-0)
1294c1230
<    0: exxhash> 2019-212.00000 1 a6abe3f663eb3313 (EAMxx-pst-sc-0)
---
>    0: exxhash> 2019-212.00000 1 bba763d255eb3169 (EAMxx-pst-sc-0)

Which was I think there before as well.

bartgol commented 8 months ago
       seed(:)=0
       seed = int(cospstateIN%phalf(:,Nlevels+1))  ! In case of NPoints=1
       ! *NOTE* Chunking will change the seed
       if (NPoints .gt. 1) seed=int((cospstateIN%phalf(:,Nlevels+1)-minval(cospstateIN%phalf(:,Nlevels+1)))/      &
            (maxval(cospstateIN%phalf(:,Nlevels+1))-minval(cospstateIN%phalf(:,Nlevels+1)))*100000) + 1
       call init_rng(rngs, seed)

@brhillman It seems to me that COSP here is setting a seed for the rng based on the min/max value of some arrays. This may be the cause of non bfbness, since min/max are rank-dependent.

Is there a reason why we use the array entries to pick a seed? Maybe we can pass it in via an extra param, so that in unit tests we can pass a seed that is NTASKS-independent...