E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
336 stars 338 forks source link

FC5AV1C-04P2 is non-BFB with threading turned on #1203

Closed singhbalwinder closed 7 years ago

singhbalwinder commented 7 years ago

@wlin7 recently identified that the model is non-BFB when run with threading turned on. He has reproduced this problem on Eidson, Anvil and Cori with ne30 and ne120 grids. I (Balwinder) was also able to reproduce this on Eos with ne16 and ne30 grids. To reproduce this, I ran a smoke test on Eos:

SMS_Ln5.ne16_ne16.FC5AV1C-04P2

Resubmitting this test (./case.submit) produces non-BFB results when compared against the initial run.

singhbalwinder commented 7 years ago

I ran few tests and identified the following line in cam/src/physics/cam/physpkg.F90 which might be causing this (line 997): !$OMP PARALLEL DO PRIVATE (C, phys_buffer_chunk)

singhbalwinder commented 7 years ago

@amametjanov ran some tests on Mira and found the results to be BFB. This may imply that the non-BFB behavior is only associated with the Intel compiler. I am not sure if @wlin7 has tested it with any other compiler. I have only run tests with the Intel compiler. Following is the email I got from @amametjanov :

The issue is probably isolated to Intel compiler . I did two runs of 
`-compset FC5AV1C-04P2 -res ne30_ne30 -mach mira` and compared atm.log's of the two runs; 
values were BFB: e.g.

run1:
 nstep, te       46   0.33526714831201005E+10   0.33526842034517279E+10   0.70337333877628250E-03   0.98533311649086012E+05
 nstep, te       47   0.33526160259059653E+10   0.33526285871776705E+10   0.69457809221461578E-03   0.98533311029669741E+05
 nstep, te       48   0.33525611951309304E+10   0.33525737290326490E+10   0.69306464339303055E-03   0.98533313845669283E+05
run2:
 nstep, te       46   0.33526714831201005E+10   0.33526842034517279E+10   0.70337333877628250E-03   0.98533311649086012E+05
 nstep, te       47   0.33526160259059653E+10   0.33526285871776705E+10   0.69457809221461578E-03   0.98533311029669741E+05
 nstep, te       48   0.33525611951309304E+10   0.33525737290326490E+10   0.69306464339303055E-03   0.98533313845669283E+05

Cprnc of cam.r.0001-01-02-00000.nc and cam.rh0.0001-01-02-00000.nc also shows the two runs to be 
identical. Both runs used 1024 nodes with 675 nodes and 4 MPI tasks x 16 OMP threads per node 
for the ATM.

So threaded F-cases are still BFB. :)
wlin7 commented 7 years ago

@singhbalwinder , @amametjanov , for record, my non-BFB runs were also using intel compiler.

worleyph commented 7 years ago

FYI - I'm comparing pgi and intel on Titan at the moment. Two runs with the Intel compiler were not BFB. PGI job will be submitted next.

mt5555 commented 7 years ago

I'll add some results here from Intel (v15) compiler on the SNL institutional cluster skybridge. I have not been able to reproduce the problem on this machine.

Note that @singhbalwinder isolated the problem to the physics parameterizations, so that would suggest this is not due to the initialization, and thus it should be detected by ERP tests (which compare a restart run to a full run).

ERP_Ld3_P8x4.ne4_ne4.FC5AV1C (PASS) ERP_Ld3_P32x4.ne16_ne16.FC5AV1C PASS ERP_P8x4.ne4_ne4.FC5AV1C PASS ERP_Ld3_P32x4.ne16_ne16.FC5AV1C-04P2 PASS

SMS_Ld3_P32x4.ne16_ne16.FC5AV1C (compare against baseline - PASS)

worleyph commented 7 years ago

@mt5555 , FYI - physics parameterizations are called during initialization.

worleyph commented 7 years ago

On Titan, PGI was B4B for two identical runs, while Intel was not., using the 675x4 for ATM and 676x4 everything else for

 -compset FC5AV1C-04P2 -res ne30_ne30
worleyph commented 7 years ago

Note that setting BFBFLAG to TRUE did not fix the problem.

singhbalwinder commented 7 years ago

Thanks all for running these tests. So far it seems like the issue only exists when running with the Intel compiler on all machines but Skybridge. I am using intel/16.0.1.150, so it might just be some specific versions of the compiler.

In my tests, the problem appears to be somewhere in the radiation codes. If I comment out the physics_update call after radiation, I get BFB answers for my five time steps runs. I am currently looking into the radiation codes.

wlin7 commented 7 years ago

Hi all,

Per Shaocheng's request, I have created a confluence page to track this problem. The problem and the main findings up to now are copied there. Since not everybody can see the github notice, main findings/updates here will be summarized and posted on the confluence until the problem is resolved. Thanks.

ndkeen commented 7 years ago

I can test on Cori with Intel 17. Is it just one of the tests? I see: SMS_Ln5.ne16_ne16.FC5AV1C-04P2 I can run that with two diff thread PE layouts, but then how to determine if they are BFB or not?

mt5555 commented 7 years ago

you should be able to run an ERP test which will compare a restart run with a full run.
To use SMS test, you would do the following:

create a baseline:
./create_test -g SMS_Ln5.ne16_ne16.FC5AV1C-04P2

make a new run, and compare against the baseline: ./create_test -c SMS_Ln5.ne16_ne16.FC5AV1C-04P2

I believe @wlin7 isn't using the test system, but instead just making two runs, and than diffing the atm.log file.

singhbalwinder commented 7 years ago

Thanks @wlin7 for setting up the confluence page.

@ndkeen : To reproduce this bug, do the following:

Run SMS.ne16_ne16.FC5AV1C-04P2 using a PE layout which has more than one thread. This run will generate atm.log. file in the run directory. After this run finishes, go to the case directory of this test and issue ./case.submit again. You will get a new atm.log. file from this second run. Compare the two files and look if there are differences in the global stats reported in these files. The global stats looks like the following:

 nstep, te        2   0.33448292351395354E+10   0.33448008133551283E+10  -0.15716762386893993E-02   0.98527804155681632E+05
 nstep, te        3   0.33449462574350176E+10   0.33449131757767444E+10  -0.18293639217668925E-02   0.98527556425431263E+05
 nstep, te        4   0.33450519447836037E+10   0.33450230985236235E+10  -0.15951560136756191E-02   0.98527371879217841E+05
 nstep, te        5   0.33451305205696392E+10   0.33451101504293952E+10  -0.11264401849883554E-02   0.98527266615589455E+05
 nstep, te        6   0.33452377748902602E+10   0.33452111422874856E+10  -0.14727470926464590E-02   0.98527165389324291E+05

If you see differences in the numbers after nstep, te between the two runs, the runs are non-BFB.

worleyph commented 7 years ago

And in my runs the difference first shows up on nstep 3, so you do not need a long run to see this. I've been using 1 day, though 5 step is probably enough.

mt5555 commented 7 years ago

I ran this test on Edison, with acme master Jan 6:

ERP_Ld3_P8x4.ne4_ne4.FC5AV1C FAIL

So this confirms that the problem does show up in an ERP test and at very low resolution (ne4).

This identical test will PASS on skybridge. These are both xeon systems, the main difference being intel compiler versions.

worleyph commented 7 years ago

I tried FC5AV1C-04P (not 04P2) and the differences showed up with nstep 1. I went back and looked again, and the same was true for FC5AV1C-04P2 (not nstep 3). Sorry for the misinformation.

worleyph commented 7 years ago

@mt5555 , the Titan results are from using intel/15.0.2.164 .

ndkeen commented 7 years ago

Ack, I got:

35: forrtl: severe (154): array index out of bounds
35: Image              PC                Routine            Line        Source             
35: acme.exe           0000000003DFB9E1  Unknown               Unknown  Unknown
35: acme.exe (deleted  0000000003DF9B1B  Unknown               Unknown  Unknown
35: acme.exe           0000000003DA1634  Unknown               Unknown  Unknown
35: acme.exe (deleted  0000000003DA1446  Unknown               Unknown  Unknown
35: acme.exe           0000000003D1FE86  Unknown               Unknown  Unknown
35: acme.exe (deleted  0000000003D2BA13  Unknown               Unknown  Unknown
35: acme.exe           00000000039D8D20  Unknown               Unknown  Unknown
35: acme.exe (deleted  00000000010F3451  Unknown               Unknown  Unknown
35: acme.exe           00000000010F9CC2  Unknown               Unknown  Unknown
35: acme.exe (deleted  0000000000E4971D  Unknown               Unknown  Unknown
35: acme.exe           00000000004FB6D9  cam_comp_mp_cam_r         240  cam_comp.F90
35: acme.exe           00000000004EAC4F  atm_comp_mct_mp_a         341  atm_comp_mct.F90
35: acme.exe           000000000042705E  component_mod_mp_         227  component_mod.F90
35: acme.exe           000000000041C577  cesm_comp_mod_mp_        1926  cesm_comp_mod.F90
35: acme.exe           000000000042408D  MAIN__                     62  cesm_driver.F90
35: acme.exe (deleted  000000000040A8DE  Unknown               Unknown  Unknown
35: acme.exe           0000000003E1B7F0  Unknown               Unknown  Unknown

This is with a master a few days ago, but still with cime5

rljacob commented 7 years ago

Skybridge has Intel 15.0.1 and Mark reports it passes. Titan has 15.0.2 and Pat reports a fail. Someone should try Blues (15.0.0).

jayeshkrishna commented 7 years ago

I will try ERP_Ld3_P8x4.ne4_ne4.FC5AV1C on blues

wlin7 commented 7 years ago

Noel, I saw this "array index out of bounds" during initialization on cori-knl when doing ne120 simulations, after successful runs using the same executable. Rerun was ok, so no further effort to dig into it. Don't know if it could be related to the current issue. You may also set "cosp_lite=.true." in user_nl_cam to reduce memory usage if the model is built with COSP.

jayeshkrishna commented 7 years ago

I ran ERP_Ld3_P8x4.ne4_ne4.FC5AV1C on blues with intel 15.0 (the default with --compiler=intel) and it crashed for me. Can someone else try it on blues?

rljacob commented 7 years ago

Note: There's a new test type in cime5 called "REP" which will simply do 2 identical runs and compare results. Try REP_Ld1_P8x4.ne4_ne4.FC5AV1C.

wlin7 commented 7 years ago

@jayeshkrishna , for the blues. are you getting error message like

HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:912): assert (!closed) failed HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@b511] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event

I have been seeing such failure lately, a lot, with the model build previously working fine.

jayeshkrishna commented 7 years ago

The error message above (HYDpmcd... ) can occur whenever a process crashes and the process manager is unable to clean up resources (close sockets etc) . I have seen issues mentioned in #1206 on blues though.

ndkeen commented 7 years ago

I re-ran on cori, and got the same problem (array index out of bounds). I too have seen this and I think sometimes it was related to using too much memory on the node. This case was asking for 64 MPI's on 1 node, so I just submitted asking for 64 MPI's across 4 nodes -- this also failed, different error, but not helpful. All I can tell is it happened before the previous one. I have run other tests similar to this on cori without a problem. I will try a debug run.

OK, with DEBUG=TRUE run, the simulation is progressing. I have actually seen this behaviour before with a ne120 F compset on Cori -- where it runs DEBUG, but not optimized, however the error message is different.

For the DEBUG run, it ran for 13 ATM steps. I submitted again and after 6 steps, all values are identical. At what point might it be not equal? Ah, I did not change env_mach_pes to use more than 1 thread. I am still trying to debug why it stops for me with 1 thread, but will also run with more.

mt5555 commented 7 years ago

@rljacob thanks for mentioning the REP test.

Edison, acme master: FAIL: ERP_Ld3_P8x4.ne4_ne4.FC5AV1C FAIL: REP_Ld3_P8x4.ne4_ne4.FC5AV1C

worleyph commented 7 years ago

FYI - I tried running v1.0.0-alpha.9-51-g7a17edb (still CIME2, updated most recently on Nov. 19?) with

 -compset FC5AV1C -res ne30_ne30 -compiler intel

and this is not deterministic either. So, this problem is not especially recent.

singhbalwinder commented 7 years ago

Thanks all for running tests to verify that the problem exists on all these platforms and also the problem was present in the past versions. I am still debugging the radiations codes to figure out whats causing this.

wlin7 commented 7 years ago

Hi @singhbalwinder and all,

An update: FC5AV1C-04P was originally BFB.

Since Balwinder Singh has traced the problem to radiation codes, and the one (probably only one) major update to the radiation was the rrtmg fix, I decided to give a test on the hash before rrtmg fix was applied. The compset FC5AV1AC-04P was created prior to that. The runs on edison with 4 threads, repeated once, are BFB. So looks like we can focus on the changes introduced to the RRTMG fix to isolate the problem. Note that with the current master, FC5AV1C-04P is also affected by the rrtmg fix. That is probably why both FC5AV1C-04P and FC5AV1C-04P2 are currently non-BFB.

The hash I used is 9e99c99ef38e477f43f3fbb1d9ed94d15db500d0, the final merge at that point was from David Hall on Oct. 31, 10:07:05. In the current master, the rrtmg fix, from branch kaizhangpnl/atm/bugfix_rrtmg, was immediately after.

This message was also posted on the confluence page.

singhbalwinder commented 7 years ago

Great! Thanks @wlin7 . I will give that a try on my end.

ndkeen commented 7 years ago

I made some progress with working on cori-knl. I found that the error I'm hitting only happens with OPT build. Furthermore, if I add "-no-vec" to the build flags, I can also get past this error. So this may be something else entirely, but now, at least, I can run. I am trying no threads and then compare to 4 threads using latest intel v17 compiler on cori-knl. Or is it preferred to compare 2 vs 4 threads?

singhbalwinder commented 7 years ago

Thanks Noel. I am running ne16 with 32 procs (8x4, 8 MPI and 4 threads). I just repeat the same simulation and compare atm.log files.

@wlin7 : I just tested the code by commenting out changes in those two files which were modified by PR #1097. The codes were still non-BFB with threading turned on. Can you repeat your test few more times? I have noticed that sometimes I get BFB results when I expect them to be non-BFB. I run it multiple times to assure myself.

wlin7 commented 7 years ago

@singhbalwinder , the hash before PR #1097 have been run four times. Gloabl mean stats are BFB. The master at that point was fine. But as you tested, rrtmg fix as in #PR 1097 is not causing the current problem.

I put the difference of the whole cam source codes between the current master and the BFB one that has just been verified at http://portal.nersc.gov/project/m2136/share/diff-4BFB-check.txt. The first files are from the current master. I haven't found anything suspicious among them.

ndkeen commented 7 years ago

FYI, I have results showing diffs in the "step" lines of atm.log* files for a couple of different changes in parameters: no threads, 2 threads, 4 threads are all different. That's using a slightly different set of flags to build (as the default is giving me an array index error as noted above). I also tried leaving on "-check bounds", which gets around the runtime stop, but these runs also shows diffs between no threads, 2, and 4 threads. So unless I'm doing something wrong, it looks like Intel v17 (intel/17.0.1.132) is also seeing non-BFB on cori-knl. For one of the runs, I upped the number of nodes to 4 and MPI's to 256. I can better summarize my results if it helps? Or is it enough to know that I also see the issue?

I am pushing 3 different directories with the following diffs in build (to get around runtime issues): build with "FFLAGS += -O2 -qno-opt-dynamic-align -no-vec", " FFLAGS += -O1 -g", and "FFLAGS += -O2 -g -check bounds".

worleyph commented 7 years ago

@ndk, I think that the issue being pursued here is that running the identical job twice with threading will not be BFB. Given this, changing these parameters will (trivially) also not be BFB.

ndkeen commented 7 years ago

Ah, OK. That's weird. I will try that now.

Re-submitting all 3 cases (described above) show that I get different results.

singhbalwinder commented 7 years ago

Thanks @wlin7 and @ndkeen .

I have been testing the codes and found that my tests were not robust enough to catch the problem every time I run them. I then decided to run and compare the following tests, assuming that they will always invoke threading differently every time to reveal this issue (may need some advise from an OpenMP expert):

SMS_Ln5_P4x8.FC5AV1C-04P and SMS_Ln5_P8x4.FC5AV1C-04P

With these tests, physics update after clubb call also makes the answers to differ. Clubb is a big code base and I am also not so familiar with the code. I took an approach of git bisect to see which hash introduced this issue. This approach found c50a810d9544251d502925db3cdf743044c2f243 hash on master which introduced this issue. I have run multiple times the hash just before this (f37020e3972dea7109b5644a5d391ca6a4315567) and c50a810d9544251d502925db3cdf743044c2f243 and I can consistently see that c50a810d9544251d502925db3cdf743044c2f243 is the one which fails. f37020e3972dea7109b5644a5d391ca6a4315567 passes my tests every time.

Now, c50a810d9544251d502925db3cdf743044c2f243 is the hash when we introduced CIME5 into master and it modifies/adds close to ~2000 files. It would be great if somebody else can verify my findings. I would also appreciate suggestions on which files among those ~2000 files should I look into for fixing this issue.

amametjanov commented 7 years ago

@singhbalwinder, could you check if OpenMP settings are correctly being propagated to compute nodes? Please add this to case.run before a run

os.system('printenv >& run_env.txt')

Things to look out for are values for env-vars:

If absent, they should be set in env_mach_specific.xml:

<environment_variables>
  <env name="OMP_STACKSIZE">256M</env>
</environment_variables>
<environment_variables>
  <env name="OMP_NUM_THREADS">4</env> <!-- or 8-->
</environment_variables>
rljacob commented 7 years ago

The CIME5 upgrade doesn't touch any CAM source files. But you may want to check that the machine/compiler/pe-layouts are the same. See https://acme-climate.atlassian.net/wiki/display/SE/CIME+upgrade+details#CIMEupgradedetails-DebuggingdifferencesbetweenCIME2andCIME5

wlin7 commented 7 years ago

@singhbalwinder , I have been using the version just before the CIME5 merge for ne120 tests. That is the one the non-BFB issue was initially observed. The last update for that hash ( 855a13aedca04cbb10dd7a51a286dc41e8e11a40) was Dec. 2.

worleyph commented 7 years ago

@rljacob, while CIME5 did not touch any CAM source files, I think that it touched how threads were set in the driver? I saw some e-mail/github traffic to that effect. Think that it was fixed, and probably irrelevant to this anyway, but just wanted to point out that CIME5 might not be completely innocuous.

mt5555 commented 7 years ago

in particular, driver_threading=.true. is needed by ACME, but I think not used by CESM and might be broken. Except @wlin7 showed the problem also exists pre-CIME5.

rljacob commented 7 years ago

DRV_THREADING in env_run.xml should be TRUE for all cases in ACME with CIME5.

mt5555 commented 7 years ago

It is TRUE for all ACME cases I've checked. But since CESM has it false, how confident are you that the driver threading code is actually working?

singhbalwinder commented 7 years ago

@wlin7 : I tested hash 855a13aedca04cbb10dd7a51a286dc41e8e11a40 with my tests above (4x8 Vs 8x4) gave BFB results in my 5 time step runs. I am now trying ne30 and ne120 grids to see if they reveal this issue using this hash.

rljacob commented 7 years ago

Pretty confident that threading is working. The code is the same. The output in acme.log from seq_comm_setcomm matches whats in env_mach_pes.xml and we have examples of threading working in other cases/machines/compilers.

wlin7 commented 7 years ago

@singhbalwinder , FYI: The non-BFB with the hash 855a13aedca04cbb10dd7a51a286dc41e8e11a40 was only tested for ne120 on cori-knl. Once it appeared threading was an issue, I switched to post-cime5 master to test ne30 , which confirmed the problem existed in the current master.

rljacob commented 7 years ago

Oh if you only saw non-BFB with hash 855a13a using ne120 on cori-knl, that could be a different non-BFB problem.

wlin7 commented 7 years ago

@rljacob , hopefully not a different kind. On cori-knl for ne120 with that hash, it was BFB when using all MPI and non-BFB when threading was on. This behavior is consistent with the non-BFB problem we are dealing with, though more recent master has been tested, mostly not for ne120.