E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
347 stars 354 forks source link

Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467

Closed golaz closed 7 years ago

golaz commented 7 years ago

I'm afraid I have some bad news to report (or maybe I did something wrong; always a possibility).

I ran some tests with a recent version of master (c9903bde) using compset A_WCYCL1850S and resolution ne30_oECv3_ICG on Edison. I tried the new PE layouts provided by @jonbob for 143 and 265 nodes. Both 5 day tests ran successfully, unfortunately the results diverge between the two simulations after a few time steps (based on atm.log files). I verified that BFBFLAG is set to true, so my understanding is that I should get the same results.

Here is the run_acme script: run_acme.20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison

Output on Edison is under: /global/cscratch1/sd/golaz/ACME_simulations/20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison/test???

rljacob commented 7 years ago

@golaz our nightly testing has revealed a similar problem. We think it was introduced this week and are currently tracking it down (on Slack).

golaz commented 7 years ago

@rljacob - good to hear that this was caught and is being tracked down.

rljacob commented 7 years ago

We found the source of our testing problem. (PR #1272). It has been removed from master which itself will change answers (that was a non-BFB PR). Hopefully that also solves your problem but we're not sure so please try again with the latest version of master.

golaz commented 7 years ago

Thanks, @rljacob and @bishtgautam. Trying now the latest version.

golaz commented 7 years ago

@rljacob : I tried again with (beea721), and unfortunately the results are still not BFB when comparing 133, 143 and 265 nodes.

rljacob commented 7 years ago

Ok. What was the last version of master where this worked for you?

ndkeen commented 7 years ago

Would it be appropriate to try this with DEBUG?

golaz commented 7 years ago

According to my notes, the last time I checked was with b1c676f40 and it worked. But I don't routinely check, as I was assuming this was part of the standard ACME testing procedure.

mt5555 commented 7 years ago

based on redsky testing of master, started 2017-04-28 03:52:49 ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed.

So there must be something subtle that A_WCYCL1850S and resolution ne30_oECv3_ICG would not be reproducible, while A_WCYCL2000 ne30_oEC is reproducible.

golaz commented 7 years ago

This is helpful. It's unlikely that this is because of 2000 vs 1850 forcing, so most likely it is due to the use of spun-up ocean and sea-ice initial conditions.

worleyph commented 7 years ago

@golaz, I also see this in my PE layout experiments (now that I look). In particular, only changing the number of OCN processes was sufficient.

rljacob commented 7 years ago

That would suggest the code that does the parallel read/init of the spun-up IC's for the ocean may be an issue.

rljacob commented 7 years ago

Tagging @mark-petersen

jonbob commented 7 years ago

@worleyph - can you check and see if the test fails for A_WCYCL1850 and ne30_oEC60to30v3? That would definitely point to something in reading the initial condition files...

worleyph commented 7 years ago

On titan, with current master, with Intel compiler,

 -compset A_WCYCL2000 -res ne30_oEC

and MPI-only PE layouts that differ only in the number of MPI processes in the OCN (512 -> 256), output in atm.log differs by 'nstep, te 5' .

rljacob commented 7 years ago

Thats a compset/res we test all the time.

ndkeen commented 7 years ago

I'm trying

./create_test PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S
./create_test ERP_Ld3.ne30_oEC.A_WCYCL2000
worleyph commented 7 years ago

My version of master had a "bug fix" in ocn_comp_mct.F and ice_comp_mct.F:

 300c300
 <     call MPAS_io_set_iotype(domain % iocontext, pio_iotype)
 ---
 > !pw    call MPAS_io_set_iotype(domain % iocontext, pio_iotype)

but I also found other recent A_WCYCL cases that show nonreproducible results when changing process count in OCN (and don't have this change).

rljacob commented 7 years ago

ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed with hash c9903bde190 from master. ERP is supposed to change the mpi-tasks in all components in the middle of a restart and test BFB.

worleyph commented 7 years ago

My tests all have OCN on its own nodes (as does @golaz 's experiments). I am building a job with components stacked, to see whether this makes a difference.

ndkeen commented 7 years ago

This issue isn't involved right? https://github.com/ESMCI/cime/issues/1433

rljacob commented 7 years ago

Doubtful.

rljacob commented 7 years ago

Pat, try your test with c9903bd. That version had a passing ERP test on redsky but it failed for Chris.

rljacob commented 7 years ago

On Redsky, the ERP_Ld3.ne30_oEC.A_WCYCL2000. test that passed starts with everything stacked on 512 tasks. It then halves them to 256. That should find this bug if it was present in that compset/resolution.

worleyph commented 7 years ago

Not necessarily. My experiment only changed OCN, so if it is the ratio of CPL to OCN processes that matters, then the stacked experiment would not exercise this.

ndkeen commented 7 years ago

Pat what exactly are you running? Must be something smaller/shorter than what I'm trying, which are those tests (265-nodes, still in Q). I did run 5 days of the run_acme script that Chris posted on top. Worked. And now I should change the pe layout, run again to see if it changes results?

Was also going to try running debug with Intel v17 to see if it catches anything obvious.

Adding here that Rob suggests to try: REP_Ln9.ne11_oQU240.A_WCYCL1850 If that passes, try REP_Ln9.ne30_oECv3_ICG.A_WCYCL1850S REP does 2 identical initial runs.

worleyph commented 7 years ago

I am running on Titan. This is not Edison-specific. I am running for 1 day.

And now I should change the pe layout, run again to see if it changes results?

Yes, though since I can reproduce directly, not going through the script, perhaps this part of the exercise is not necessary at this point in time. How you change the PE layout apparently matters? I just changed the number of OCN tasks only.

worleyph commented 7 years ago

@rljacob , my 'stacked' experiment was also not reproducible: went from 512x1 to 480x1. I am trying 256x1 next.

rljacob commented 7 years ago

Thanks. Which machine/compiler are you using?

rljacob commented 7 years ago

Nevermind. Titan.

worleyph commented 7 years ago

and Intel compiler.

rljacob commented 7 years ago

Which version of intel?

worleyph commented 7 years ago

intel/15.0.2.164 (the current default on Titan)

agsalin commented 7 years ago

Just trying to follow along -- is everything consistent with the hypothesis that reading in a spun-up ocean (designated by S at end of the compset) is the part that isn't reproducible for different number of PEs?

worleyph commented 7 years ago

Don't think so - I am using a generic A_WCYCL2000 compset.

rljacob commented 7 years ago

Could be that 2 bugs are being chased. Really surprising that Pat is seeing this when ERP_Ld3.ne30_oEC.A_WCYCL2000 passed recently on redsky.

worleyph commented 7 years ago

It is even different comparing 256x1 with 512x1 - which is what ERP_Ld3.ne30_oEC.A_WCYCL2000 should be doing? I'll try c9903bd. What process counts are used in the ERP_Ld3.ne30_oEC.A_WCYCL2000 test?

rljacob commented 7 years ago

According to the test output (http://my.cdash.org/testDetails.php?test=31255751&build=1185126) it starts with 512. Since "Ld3" is added, it should run for 3 days with 512 and output a restart at day 2. Then pick up that restart with a 256-proc executable and run for 1 day. Then compare output in coupler history files from the end of each run.

jonbob commented 7 years ago

I wish that hypothesis made more sense to me, but I think reading in the spun-up conditions uses the same code as reading in a restart file. I have experiments in mind that I'll try today and tomorrow and see if I can pin it down.


From: Andy Salinger notifications@github.com Sent: Friday, April 28, 2017 3:01 PM To: ACME-Climate/ACME Cc: Wolfe, Jonathan David; Mention Subject: Re: [ACME-Climate/ACME] Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes (#1467)

Just trying to follow along -- is everything consistent with the hypothesis that reading in a spun-up ocean (designated by S at end of the compset) is the part that isn't reproducible for different number of PEs?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ACME-Climate/ACME/issues/1467#issuecomment-298105638, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANDQcG9GAqutY9GfrIsW4Ag0CO8TcmF3ks5r0lOOgaJpZM4NKxU2.

worleyph commented 7 years ago

@rljacob , haven't we run into this before? A nonreproducibility bug that only showed up when comparing two initial runs, and not from a checkpoint? Not that we have verified this, but this sure sounds familiar.

rljacob commented 7 years ago

Indeed we did. We don't have an explicit test for that in the suite but that kind of problem would create an un-expected diff with baselines.

Unless it snuck in with a non-BFB change....

rljacob commented 7 years ago

Everyone try this test: REP_Ln9.ne11_oQU240.A_WCYCL1850

rljacob commented 7 years ago

small and short. If it passes, try REP_Ld3.ne30_oECv3_ICG.A_WCYCL1850S

worleyph commented 7 years ago

I'm stuck in my current tests at the moment (and most of my cost is the compilation). Again, I see differences in the 5th timestep in the atm.log file, so a very short run may be sufficient.

jonbob commented 7 years ago

I have another hypothesis. We fulfilled a PR at the end of February, #1291, that changed restart fields for the ocn. But I checked and the IC file for oEC60to30v3 is older than that and doesn't have the new fields. It could be possible that whatever the ocean is doing in that situation is not consistent over processor counts? I have another set of ICG files I can point to to see if this is responsible.

worleyph commented 7 years ago

Having flashbacks here - I did just now check that two runs with the same layout are the same (through 1 day).

worleyph commented 7 years ago

Repeated the experiments (512x1 and 256x1) using c9903bd and the Intel compiler. Same results - atm.log diverge starting with nstep, te 5. Then tried PGI, and it showed divergence starting with nstep, te 2 ... that was a surprise.

worleyph commented 7 years ago

Ran 512x1 for all components except for OCN, where I used 256x1. This is BFB with the run with all components using 256x1 (for Intel).

worleyph commented 7 years ago

Tried same experiment with PGI (Ran 512x1 for all components except for OCN, where I used 256x1.)

a) This was NOT the same as 256x1 job (unlike for Intel). b) It was the same as all 512x1 until nstep 5 , so same as Intel result.

Appears that there are at least two issues, one with OCN (PGI and Intel), and one PGI only.

worleyph commented 7 years ago

I tried v1.0.0-beta.1 (00a38722dbce8eaefa690669c1d98bdd11d56154) and it has the same behavior.