E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
74 stars 54 forks source link

Fails with ne4 CIME runs on Weaver #1551

Open tcclevenger opened 2 years ago

tcclevenger commented 2 years ago

When running this test case

./create_test SMS_D_Ln10_P1x1.ne4_ne4.F2000SCREAMv1.weaver_gnugpu

we get a fail during the first timestep in elm

[weaver4:25391:0:25391] Caught signal 8 (Floating point exception: floating-point divide by zero)

/home/tccleve/E3SM/SCREAM/scream/components/elm/src/biogeophys/SnowSnicarMod.F90: [ __snowsnicarmod_MOD_snicar_ad_rt() ]
      ...
     2186              ! Loop over snow spectral bands
     2187 
     2188              exp_min = exp(-argmax)
==>  2189              do bnd_idx = 1,numrad_snw
     2190 
     2191                ! note that we can remove flg_dover since this algorithm is
     2192                ! stable for mu_not > 0.01

==== backtrace (tid:  25391) ====
 0 0x0000000010bd75c8 __snowsnicarmod_MOD_snicar_ad_rt()  /home/tccleve/E3SM/SCREAM/scream/components/elm/src/biogeophys/SnowSnicarMod.F90:2189
 1 0x0000000010dbd138 __surfacealbedomod_MOD_surfacealbedo()  /home/tccleve/E3SM/SCREAM/scream/components/elm/src/biogeophys/SurfaceAlbedoMod.F90:637
 2 0x00000000102c88d0 __elm_driver_MOD_elm_drv()  /home/tccleve/E3SM/SCREAM/scream/components/elm/src/main/elm_driver.F90:1323
 3 0x000000001028474c __lnd_comp_mct_MOD_lnd_run_mct()  /home/tccleve/E3SM/SCREAM/scream/components/elm/src/cpl/lnd_comp_mct.F90:512
 4 0x000000001005d030 __component_mod_MOD_component_run()  /home/tccleve/E3SM/SCREAM/scream/driver-mct/main/component_mod.F90:728
 5 0x00000000100379b4 __cime_comp_mod_MOD_cime_run()  /home/tccleve/E3SM/SCREAM/scream/driver-mct/main/cime_comp_mod.F90:2889
 6 0x0000000010059eb4 MAIN__()  /home/tccleve/E3SM/SCREAM/scream/driver-mct/main/cime_driver.F90:153
 7 0x0000000010059f64 main()  /home/tccleve/E3SM/SCREAM/scream/driver-mct/main/cime_driver.F90:23
 8 0x0000000000025100 generic_start_main.isra.0()  libc-start.c:0
 9 0x00000000000252f4 __libc_start_main()  ???:0
=================================

This is not a generic GPU fail, since we run with no such issues on Summit, which is also V100, and perlmutter. Seems to be Weaver specific.

bartgol commented 2 years ago

Is this current master? I know there were some ELM issues that supposedly should have been fixed by the last upstream merge. @AaronDonahue might have insight on ELM failures.

tcclevenger commented 2 years ago

Yeah, I've talked to both @ndkeen and @PeterCaldwell. The new fixes (which were in the upstream merge) did not take care of this error.