Closed singhbalwinder closed 7 years ago
I ran few tests and identified the following line in cam/src/physics/cam/physpkg.F90 which might be causing this (line 997):
!$OMP PARALLEL DO PRIVATE (C, phys_buffer_chunk)
@amametjanov ran some tests on Mira and found the results to be BFB. This may imply that the non-BFB behavior is only associated with the Intel compiler. I am not sure if @wlin7 has tested it with any other compiler. I have only run tests with the Intel compiler. Following is the email I got from @amametjanov :
The issue is probably isolated to Intel compiler . I did two runs of
`-compset FC5AV1C-04P2 -res ne30_ne30 -mach mira` and compared atm.log's of the two runs;
values were BFB: e.g.
run1:
nstep, te 46 0.33526714831201005E+10 0.33526842034517279E+10 0.70337333877628250E-03 0.98533311649086012E+05
nstep, te 47 0.33526160259059653E+10 0.33526285871776705E+10 0.69457809221461578E-03 0.98533311029669741E+05
nstep, te 48 0.33525611951309304E+10 0.33525737290326490E+10 0.69306464339303055E-03 0.98533313845669283E+05
run2:
nstep, te 46 0.33526714831201005E+10 0.33526842034517279E+10 0.70337333877628250E-03 0.98533311649086012E+05
nstep, te 47 0.33526160259059653E+10 0.33526285871776705E+10 0.69457809221461578E-03 0.98533311029669741E+05
nstep, te 48 0.33525611951309304E+10 0.33525737290326490E+10 0.69306464339303055E-03 0.98533313845669283E+05
Cprnc of cam.r.0001-01-02-00000.nc and cam.rh0.0001-01-02-00000.nc also shows the two runs to be
identical. Both runs used 1024 nodes with 675 nodes and 4 MPI tasks x 16 OMP threads per node
for the ATM.
So threaded F-cases are still BFB. :)
@singhbalwinder , @amametjanov , for record, my non-BFB runs were also using intel compiler.
FYI - I'm comparing pgi and intel on Titan at the moment. Two runs with the Intel compiler were not BFB. PGI job will be submitted next.
I'll add some results here from Intel (v15) compiler on the SNL institutional cluster skybridge. I have not been able to reproduce the problem on this machine.
Note that @singhbalwinder isolated the problem to the physics parameterizations, so that would suggest this is not due to the initialization, and thus it should be detected by ERP tests (which compare a restart run to a full run).
ERP_Ld3_P8x4.ne4_ne4.FC5AV1C (PASS) ERP_Ld3_P32x4.ne16_ne16.FC5AV1C PASS ERP_P8x4.ne4_ne4.FC5AV1C PASS ERP_Ld3_P32x4.ne16_ne16.FC5AV1C-04P2 PASS
SMS_Ld3_P32x4.ne16_ne16.FC5AV1C (compare against baseline - PASS)
@mt5555 , FYI - physics parameterizations are called during initialization.
On Titan, PGI was B4B for two identical runs, while Intel was not., using the 675x4 for ATM and 676x4 everything else for
-compset FC5AV1C-04P2 -res ne30_ne30
Note that setting BFBFLAG to TRUE did not fix the problem.
Thanks all for running these tests. So far it seems like the issue only exists when running with the Intel compiler on all machines but Skybridge. I am using intel/16.0.1.150, so it might just be some specific versions of the compiler.
In my tests, the problem appears to be somewhere in the radiation codes. If I comment out the physics_update call after radiation, I get BFB answers for my five time steps runs. I am currently looking into the radiation codes.
Hi all,
Per Shaocheng's request, I have created a confluence page to track this problem. The problem and the main findings up to now are copied there. Since not everybody can see the github notice, main findings/updates here will be summarized and posted on the confluence until the problem is resolved. Thanks.
I can test on Cori with Intel 17. Is it just one of the tests? I see: SMS_Ln5.ne16_ne16.FC5AV1C-04P2 I can run that with two diff thread PE layouts, but then how to determine if they are BFB or not?
you should be able to run an ERP test which will compare a restart run with a full run.
To use SMS test, you would do the following:
create a baseline:
./create_test -g SMS_Ln5.ne16_ne16.FC5AV1C-04P2
make a new run, and compare against the baseline: ./create_test -c SMS_Ln5.ne16_ne16.FC5AV1C-04P2
I believe @wlin7 isn't using the test system, but instead just making two runs, and than diffing the atm.log file.
Thanks @wlin7 for setting up the confluence page.
@ndkeen : To reproduce this bug, do the following:
Run SMS.ne16_ne16.FC5AV1C-04P2 using a PE layout which has more than one thread. This run will generate atm.log. file in the run directory. After this run finishes, go to the case directory of this test and issue ./case.submit
again. You will get a new atm.log. file from this second run. Compare the two files and look if there are differences in the global stats reported in these files. The global stats looks like the following:
nstep, te 2 0.33448292351395354E+10 0.33448008133551283E+10 -0.15716762386893993E-02 0.98527804155681632E+05
nstep, te 3 0.33449462574350176E+10 0.33449131757767444E+10 -0.18293639217668925E-02 0.98527556425431263E+05
nstep, te 4 0.33450519447836037E+10 0.33450230985236235E+10 -0.15951560136756191E-02 0.98527371879217841E+05
nstep, te 5 0.33451305205696392E+10 0.33451101504293952E+10 -0.11264401849883554E-02 0.98527266615589455E+05
nstep, te 6 0.33452377748902602E+10 0.33452111422874856E+10 -0.14727470926464590E-02 0.98527165389324291E+05
If you see differences in the numbers after nstep, te
between the two runs, the runs are non-BFB.
And in my runs the difference first shows up on nstep 3, so you do not need a long run to see this. I've been using 1 day, though 5 step is probably enough.
I ran this test on Edison, with acme master Jan 6:
ERP_Ld3_P8x4.ne4_ne4.FC5AV1C FAIL
So this confirms that the problem does show up in an ERP test and at very low resolution (ne4).
This identical test will PASS on skybridge. These are both xeon systems, the main difference being intel compiler versions.
I tried FC5AV1C-04P (not 04P2) and the differences showed up with nstep 1. I went back and looked again, and the same was true for FC5AV1C-04P2 (not nstep 3). Sorry for the misinformation.
@mt5555 , the Titan results are from using intel/15.0.2.164 .
Ack, I got:
35: forrtl: severe (154): array index out of bounds
35: Image PC Routine Line Source
35: acme.exe 0000000003DFB9E1 Unknown Unknown Unknown
35: acme.exe (deleted 0000000003DF9B1B Unknown Unknown Unknown
35: acme.exe 0000000003DA1634 Unknown Unknown Unknown
35: acme.exe (deleted 0000000003DA1446 Unknown Unknown Unknown
35: acme.exe 0000000003D1FE86 Unknown Unknown Unknown
35: acme.exe (deleted 0000000003D2BA13 Unknown Unknown Unknown
35: acme.exe 00000000039D8D20 Unknown Unknown Unknown
35: acme.exe (deleted 00000000010F3451 Unknown Unknown Unknown
35: acme.exe 00000000010F9CC2 Unknown Unknown Unknown
35: acme.exe (deleted 0000000000E4971D Unknown Unknown Unknown
35: acme.exe 00000000004FB6D9 cam_comp_mp_cam_r 240 cam_comp.F90
35: acme.exe 00000000004EAC4F atm_comp_mct_mp_a 341 atm_comp_mct.F90
35: acme.exe 000000000042705E component_mod_mp_ 227 component_mod.F90
35: acme.exe 000000000041C577 cesm_comp_mod_mp_ 1926 cesm_comp_mod.F90
35: acme.exe 000000000042408D MAIN__ 62 cesm_driver.F90
35: acme.exe (deleted 000000000040A8DE Unknown Unknown Unknown
35: acme.exe 0000000003E1B7F0 Unknown Unknown Unknown
This is with a master a few days ago, but still with cime5
Skybridge has Intel 15.0.1 and Mark reports it passes. Titan has 15.0.2 and Pat reports a fail. Someone should try Blues (15.0.0).
I will try ERP_Ld3_P8x4.ne4_ne4.FC5AV1C on blues
Noel, I saw this "array index out of bounds" during initialization on cori-knl when doing ne120 simulations, after successful runs using the same executable. Rerun was ok, so no further effort to dig into it. Don't know if it could be related to the current issue. You may also set "cosp_lite=.true." in user_nl_cam to reduce memory usage if the model is built with COSP.
I ran ERP_Ld3_P8x4.ne4_ne4.FC5AV1C on blues with intel 15.0 (the default with --compiler=intel) and it crashed for me. Can someone else try it on blues?
Note: There's a new test type in cime5 called "REP" which will simply do 2 identical runs and compare results. Try REP_Ld1_P8x4.ne4_ne4.FC5AV1C.
@jayeshkrishna , for the blues. are you getting error message like
HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:912): assert (!closed) failed HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@b511] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
I have been seeing such failure lately, a lot, with the model build previously working fine.
The error message above (HYDpmcd... ) can occur whenever a process crashes and the process manager is unable to clean up resources (close sockets etc) . I have seen issues mentioned in #1206 on blues though.
I re-ran on cori, and got the same problem (array index out of bounds). I too have seen this and I think sometimes it was related to using too much memory on the node. This case was asking for 64 MPI's on 1 node, so I just submitted asking for 64 MPI's across 4 nodes -- this also failed, different error, but not helpful. All I can tell is it happened before the previous one. I have run other tests similar to this on cori without a problem. I will try a debug run.
OK, with DEBUG=TRUE run, the simulation is progressing. I have actually seen this behaviour before with a ne120 F compset on Cori -- where it runs DEBUG, but not optimized, however the error message is different.
For the DEBUG run, it ran for 13 ATM steps. I submitted again and after 6 steps, all values are identical. At what point might it be not equal? Ah, I did not change env_mach_pes to use more than 1 thread. I am still trying to debug why it stops for me with 1 thread, but will also run with more.
@rljacob thanks for mentioning the REP test.
Edison, acme master: FAIL: ERP_Ld3_P8x4.ne4_ne4.FC5AV1C FAIL: REP_Ld3_P8x4.ne4_ne4.FC5AV1C
FYI - I tried running v1.0.0-alpha.9-51-g7a17edb (still CIME2, updated most recently on Nov. 19?) with
-compset FC5AV1C -res ne30_ne30 -compiler intel
and this is not deterministic either. So, this problem is not especially recent.
Thanks all for running tests to verify that the problem exists on all these platforms and also the problem was present in the past versions. I am still debugging the radiations codes to figure out whats causing this.
Hi @singhbalwinder and all,
An update: FC5AV1C-04P was originally BFB.
Since Balwinder Singh has traced the problem to radiation codes, and the one (probably only one) major update to the radiation was the rrtmg fix, I decided to give a test on the hash before rrtmg fix was applied. The compset FC5AV1AC-04P was created prior to that. The runs on edison with 4 threads, repeated once, are BFB. So looks like we can focus on the changes introduced to the RRTMG fix to isolate the problem. Note that with the current master, FC5AV1C-04P is also affected by the rrtmg fix. That is probably why both FC5AV1C-04P and FC5AV1C-04P2 are currently non-BFB.
The hash I used is 9e99c99ef38e477f43f3fbb1d9ed94d15db500d0, the final merge at that point was from David Hall on Oct. 31, 10:07:05. In the current master, the rrtmg fix, from branch kaizhangpnl/atm/bugfix_rrtmg, was immediately after.
This message was also posted on the confluence page.
Great! Thanks @wlin7 . I will give that a try on my end.
I made some progress with working on cori-knl. I found that the error I'm hitting only happens with OPT build. Furthermore, if I add "-no-vec" to the build flags, I can also get past this error. So this may be something else entirely, but now, at least, I can run. I am trying no threads and then compare to 4 threads using latest intel v17 compiler on cori-knl. Or is it preferred to compare 2 vs 4 threads?
Thanks Noel. I am running ne16 with 32 procs (8x4, 8 MPI and 4 threads). I just repeat the same simulation and compare atm.log files.
@wlin7 : I just tested the code by commenting out changes in those two files which were modified by PR #1097. The codes were still non-BFB with threading turned on. Can you repeat your test few more times? I have noticed that sometimes I get BFB results when I expect them to be non-BFB. I run it multiple times to assure myself.
@singhbalwinder , the hash before PR #1097 have been run four times. Gloabl mean stats are BFB. The master at that point was fine. But as you tested, rrtmg fix as in #PR 1097 is not causing the current problem.
I put the difference of the whole cam source codes between the current master and the BFB one that has just been verified at http://portal.nersc.gov/project/m2136/share/diff-4BFB-check.txt. The first files are from the current master. I haven't found anything suspicious among them.
FYI, I have results showing diffs in the "step" lines of atm.log* files for a couple of different changes in parameters: no threads, 2 threads, 4 threads are all different. That's using a slightly different set of flags to build (as the default is giving me an array index error as noted above). I also tried leaving on "-check bounds", which gets around the runtime stop, but these runs also shows diffs between no threads, 2, and 4 threads. So unless I'm doing something wrong, it looks like Intel v17 (intel/17.0.1.132) is also seeing non-BFB on cori-knl. For one of the runs, I upped the number of nodes to 4 and MPI's to 256. I can better summarize my results if it helps? Or is it enough to know that I also see the issue?
I am pushing 3 different directories with the following diffs in build (to get around runtime issues): build with "FFLAGS += -O2 -qno-opt-dynamic-align -no-vec", " FFLAGS += -O1 -g", and "FFLAGS += -O2 -g -check bounds".
@ndk, I think that the issue being pursued here is that running the identical job twice with threading will not be BFB. Given this, changing these parameters will (trivially) also not be BFB.
Ah, OK. That's weird. I will try that now.
Re-submitting all 3 cases (described above) show that I get different results.
Thanks @wlin7 and @ndkeen .
I have been testing the codes and found that my tests were not robust enough to catch the problem every time I run them. I then decided to run and compare the following tests, assuming that they will always invoke threading differently every time to reveal this issue (may need some advise from an OpenMP expert):
SMS_Ln5_P4x8.FC5AV1C-04P and SMS_Ln5_P8x4.FC5AV1C-04P
With these tests, physics update after clubb call also makes the answers to differ. Clubb is a big code base and I am also not so familiar with the code. I took an approach of git bisect
to see which hash introduced this issue. This approach found c50a810d9544251d502925db3cdf743044c2f243 hash on master which introduced this issue. I have run multiple times the hash just before this (f37020e3972dea7109b5644a5d391ca6a4315567) and c50a810d9544251d502925db3cdf743044c2f243 and I can consistently see that c50a810d9544251d502925db3cdf743044c2f243 is the one which fails. f37020e3972dea7109b5644a5d391ca6a4315567 passes my tests every time.
Now, c50a810d9544251d502925db3cdf743044c2f243 is the hash when we introduced CIME5 into master and it modifies/adds close to ~2000 files. It would be great if somebody else can verify my findings. I would also appreciate suggestions on which files among those ~2000 files should I look into for fixing this issue.
@singhbalwinder, could you check if OpenMP settings are correctly being propagated to compute nodes? Please add this to case.run
before a run
os.system('printenv >& run_env.txt')
Things to look out for are values for env-vars:
If absent, they should be set in env_mach_specific.xml:
<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
</environment_variables>
<environment_variables>
<env name="OMP_NUM_THREADS">4</env> <!-- or 8-->
</environment_variables>
The CIME5 upgrade doesn't touch any CAM source files. But you may want to check that the machine/compiler/pe-layouts are the same. See https://acme-climate.atlassian.net/wiki/display/SE/CIME+upgrade+details#CIMEupgradedetails-DebuggingdifferencesbetweenCIME2andCIME5
@singhbalwinder , I have been using the version just before the CIME5 merge for ne120 tests. That is the one the non-BFB issue was initially observed. The last update for that hash ( 855a13aedca04cbb10dd7a51a286dc41e8e11a40) was Dec. 2.
@rljacob, while CIME5 did not touch any CAM source files, I think that it touched how threads were set in the driver? I saw some e-mail/github traffic to that effect. Think that it was fixed, and probably irrelevant to this anyway, but just wanted to point out that CIME5 might not be completely innocuous.
in particular, driver_threading=.true. is needed by ACME, but I think not used by CESM and might be broken. Except @wlin7 showed the problem also exists pre-CIME5.
DRV_THREADING in env_run.xml should be TRUE for all cases in ACME with CIME5.
It is TRUE for all ACME cases I've checked. But since CESM has it false, how confident are you that the driver threading code is actually working?
@wlin7 : I tested hash 855a13aedca04cbb10dd7a51a286dc41e8e11a40 with my tests above (4x8 Vs 8x4) gave BFB results in my 5 time step runs. I am now trying ne30 and ne120 grids to see if they reveal this issue using this hash.
Pretty confident that threading is working. The code is the same. The output in acme.log from seq_comm_setcomm matches whats in env_mach_pes.xml and we have examples of threading working in other cases/machines/compilers.
@singhbalwinder , FYI: The non-BFB with the hash 855a13aedca04cbb10dd7a51a286dc41e8e11a40 was only tested for ne120 on cori-knl. Once it appeared threading was an issue, I switched to post-cime5 master to test ne30 , which confirmed the problem existed in the current master.
Oh if you only saw non-BFB with hash 855a13a using ne120 on cori-knl, that could be a different non-BFB problem.
@rljacob , hopefully not a different kind. On cori-knl for ne120 with that hash, it was BFB when using all MPI and non-BFB when threading was on. This behavior is consistent with the non-BFB problem we are dealing with, though more recent master has been tested, mostly not for ne120.
@wlin7 recently identified that the model is non-BFB when run with threading turned on. He has reproduced this problem on Eidson, Anvil and Cori with ne30 and ne120 grids. I (Balwinder) was also able to reproduce this on Eos with ne16 and ne30 grids. To reproduce this, I ran a smoke test on Eos:
SMS_Ln5.ne16_ne16.FC5AV1C-04P2
Resubmitting this test (./case.submit) produces non-BFB results when compared against the initial run.