Closed golaz closed 7 years ago
@golaz our nightly testing has revealed a similar problem. We think it was introduced this week and are currently tracking it down (on Slack).
@rljacob - good to hear that this was caught and is being tracked down.
We found the source of our testing problem. (PR #1272). It has been removed from master which itself will change answers (that was a non-BFB PR). Hopefully that also solves your problem but we're not sure so please try again with the latest version of master.
Thanks, @rljacob and @bishtgautam. Trying now the latest version.
@rljacob : I tried again with (beea721), and unfortunately the results are still not BFB when comparing 133, 143 and 265 nodes.
Ok. What was the last version of master where this worked for you?
Would it be appropriate to try this with DEBUG?
According to my notes, the last time I checked was with b1c676f40 and it worked. But I don't routinely check, as I was assuming this was part of the standard ACME testing procedure.
based on redsky testing of master, started 2017-04-28 03:52:49 ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed.
So there must be something subtle that A_WCYCL1850S and resolution ne30_oECv3_ICG would not be reproducible, while A_WCYCL2000 ne30_oEC is reproducible.
This is helpful. It's unlikely that this is because of 2000 vs 1850 forcing, so most likely it is due to the use of spun-up ocean and sea-ice initial conditions.
@golaz, I also see this in my PE layout experiments (now that I look). In particular, only changing the number of OCN processes was sufficient.
That would suggest the code that does the parallel read/init of the spun-up IC's for the ocean may be an issue.
Tagging @mark-petersen
@worleyph - can you check and see if the test fails for A_WCYCL1850 and ne30_oEC60to30v3? That would definitely point to something in reading the initial condition files...
On titan, with current master, with Intel compiler,
-compset A_WCYCL2000 -res ne30_oEC
and MPI-only PE layouts that differ only in the number of MPI processes in the OCN (512 -> 256), output in atm.log differs by 'nstep, te 5' .
Thats a compset/res we test all the time.
I'm trying
./create_test PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S
./create_test ERP_Ld3.ne30_oEC.A_WCYCL2000
My version of master had a "bug fix" in ocn_comp_mct.F and ice_comp_mct.F:
300c300
< call MPAS_io_set_iotype(domain % iocontext, pio_iotype)
---
> !pw call MPAS_io_set_iotype(domain % iocontext, pio_iotype)
but I also found other recent A_WCYCL cases that show nonreproducible results when changing process count in OCN (and don't have this change).
ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed with hash c9903bde190 from master. ERP is supposed to change the mpi-tasks in all components in the middle of a restart and test BFB.
My tests all have OCN on its own nodes (as does @golaz 's experiments). I am building a job with components stacked, to see whether this makes a difference.
This issue isn't involved right? https://github.com/ESMCI/cime/issues/1433
Doubtful.
Pat, try your test with c9903bd. That version had a passing ERP test on redsky but it failed for Chris.
On Redsky, the ERP_Ld3.ne30_oEC.A_WCYCL2000. test that passed starts with everything stacked on 512 tasks. It then halves them to 256. That should find this bug if it was present in that compset/resolution.
Not necessarily. My experiment only changed OCN, so if it is the ratio of CPL to OCN processes that matters, then the stacked experiment would not exercise this.
Pat what exactly are you running? Must be something smaller/shorter than what I'm trying, which are those tests (265-nodes, still in Q). I did run 5 days of the run_acme script that Chris posted on top. Worked. And now I should change the pe layout, run again to see if it changes results?
Was also going to try running debug with Intel v17 to see if it catches anything obvious.
Adding here that Rob suggests to try: REP_Ln9.ne11_oQU240.A_WCYCL1850 If that passes, try REP_Ln9.ne30_oECv3_ICG.A_WCYCL1850S REP does 2 identical initial runs.
I am running on Titan. This is not Edison-specific. I am running for 1 day.
And now I should change the pe layout, run again to see if it changes results?
Yes, though since I can reproduce directly, not going through the script, perhaps this part of the exercise is not necessary at this point in time. How you change the PE layout apparently matters? I just changed the number of OCN tasks only.
@rljacob , my 'stacked' experiment was also not reproducible: went from 512x1 to 480x1. I am trying 256x1 next.
Thanks. Which machine/compiler are you using?
Nevermind. Titan.
and Intel compiler.
Which version of intel?
intel/15.0.2.164 (the current default on Titan)
Just trying to follow along -- is everything consistent with the hypothesis that reading in a spun-up ocean (designated by S at end of the compset) is the part that isn't reproducible for different number of PEs?
Don't think so - I am using a generic A_WCYCL2000 compset.
Could be that 2 bugs are being chased. Really surprising that Pat is seeing this when ERP_Ld3.ne30_oEC.A_WCYCL2000 passed recently on redsky.
It is even different comparing 256x1 with 512x1 - which is what ERP_Ld3.ne30_oEC.A_WCYCL2000 should be doing? I'll try c9903bd. What process counts are used in the ERP_Ld3.ne30_oEC.A_WCYCL2000 test?
According to the test output (http://my.cdash.org/testDetails.php?test=31255751&build=1185126) it starts with 512. Since "Ld3" is added, it should run for 3 days with 512 and output a restart at day 2. Then pick up that restart with a 256-proc executable and run for 1 day. Then compare output in coupler history files from the end of each run.
I wish that hypothesis made more sense to me, but I think reading in the spun-up conditions uses the same code as reading in a restart file. I have experiments in mind that I'll try today and tomorrow and see if I can pin it down.
From: Andy Salinger notifications@github.com Sent: Friday, April 28, 2017 3:01 PM To: ACME-Climate/ACME Cc: Wolfe, Jonathan David; Mention Subject: Re: [ACME-Climate/ACME] Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes (#1467)
Just trying to follow along -- is everything consistent with the hypothesis that reading in a spun-up ocean (designated by S at end of the compset) is the part that isn't reproducible for different number of PEs?
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ACME-Climate/ACME/issues/1467#issuecomment-298105638, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANDQcG9GAqutY9GfrIsW4Ag0CO8TcmF3ks5r0lOOgaJpZM4NKxU2.
@rljacob , haven't we run into this before? A nonreproducibility bug that only showed up when comparing two initial runs, and not from a checkpoint? Not that we have verified this, but this sure sounds familiar.
Indeed we did. We don't have an explicit test for that in the suite but that kind of problem would create an un-expected diff with baselines.
Unless it snuck in with a non-BFB change....
Everyone try this test: REP_Ln9.ne11_oQU240.A_WCYCL1850
small and short. If it passes, try REP_Ld3.ne30_oECv3_ICG.A_WCYCL1850S
I'm stuck in my current tests at the moment (and most of my cost is the compilation). Again, I see differences in the 5th timestep in the atm.log file, so a very short run may be sufficient.
I have another hypothesis. We fulfilled a PR at the end of February, #1291, that changed restart fields for the ocn. But I checked and the IC file for oEC60to30v3 is older than that and doesn't have the new fields. It could be possible that whatever the ocean is doing in that situation is not consistent over processor counts? I have another set of ICG files I can point to to see if this is responsible.
Having flashbacks here - I did just now check that two runs with the same layout are the same (through 1 day).
Repeated the experiments (512x1 and 256x1) using c9903bd and the Intel compiler. Same results - atm.log diverge starting with nstep, te 5. Then tried PGI, and it showed divergence starting with nstep, te 2 ... that was a surprise.
Ran 512x1 for all components except for OCN, where I used 256x1. This is BFB with the run with all components using 256x1 (for Intel).
Tried same experiment with PGI (Ran 512x1 for all components except for OCN, where I used 256x1.)
a) This was NOT the same as 256x1 job (unlike for Intel). b) It was the same as all 512x1 until nstep 5 , so same as Intel result.
Appears that there are at least two issues, one with OCN (PGI and Intel), and one PGI only.
I tried v1.0.0-beta.1 (00a38722dbce8eaefa690669c1d98bdd11d56154) and it has the same behavior.
I'm afraid I have some bad news to report (or maybe I did something wrong; always a possibility).
I ran some tests with a recent version of master (c9903bde) using compset A_WCYCL1850S and resolution ne30_oECv3_ICG on Edison. I tried the new PE layouts provided by @jonbob for 143 and 265 nodes. Both 5 day tests ran successfully, unfortunately the results diverge between the two simulations after a few time steps (based on atm.log files). I verified that BFBFLAG is set to true, so my understanding is that I should get the same results.
Here is the run_acme script: run_acme.20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison
Output on Edison is under:
/global/cscratch1/sd/golaz/ACME_simulations/20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison/test???