E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 363 forks source link

Floating invalid in CLUBB with ne16/ne30/ne120 F case in DEBUG mode with 1 thread on Cori #3142

Closed ndkeen closed 4 years ago

ndkeen commented 5 years ago

With master of August 14th, I get an error with F-case in DEBUG. Using default compiler and same error with intel19. It seems to happen in ATM init.

SMS_D_PT_PMx1_Ld1.ne120_ne120.FC5AV1C-H01A.cori-knl_intel
SMS_D_PT_PMx1_Ld1.ne120_ne120.FC5AV1C-H01A.cori-knl_intel19

And now I see I get same error with ne30 and ne16

SMS_D_PT_PMx1_Ld1.ne30_ne30.FC5AV1C-L
SMS_D_PT_PMx1_Ld1.ne16_ne16.FC5AV1C-L
 5376: forrtl: error (65): floating invalid
 5376: Image              PC                Routine            Line        Source
 5376: e3sm.exe           000000002C02B614  Unknown               Unknown  Unknown
 5376: e3sm.exe           000000002B8DA190  Unknown               Unknown  Unknown
 5376: e3sm.exe           0000000021D8E269  clubb_intr_mp_clu        1740  clubb_intr.F90
 5376: e3sm.exe           0000000022293306  physpkg_mp_tphysb        2539  physpkg.F90
 5376: e3sm.exe           0000000022267744  physpkg_mp_phys_r        1047  physpkg.F90
 5376: e3sm.exe           0000000020447DEB  cam_comp_mp_cam_r         250  cam_comp.F90
 5376: e3sm.exe           000000002041452A  atm_comp_mct_mp_a         396  atm_comp_mct.F90
 5376: e3sm.exe           00000000200535E2  component_mod_mp_         257  component_mod.F90
 5376: e3sm.exe           00000000200230A1  cime_comp_mod_mp_        2187  cime_comp_mod.F90
 5376: e3sm.exe           000000002004A1A5  MAIN__                    197  cime_driver.F90
 5376: e3sm.exe           0000000020001B52  Unknown               Unknown  Unknown
 5376: e3sm.exe           000000002C10EF8F  Unknown               Unknown  Unknown
 5376: e3sm.exe           0000000020001A3A  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/m08-aug14/SMS_D_PT_PMx1_Ld1.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20190818_123120_vikm6n

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/m08-aug14/SMS_D_PT_PMx1_Ld1.ne120_ne120.FC5AV1C-H01A.cori-knl_intel19.20190818_123147_ibg1wo
ndkeen commented 5 years ago

OK, this looks like one of the issues I was tracking in https://github.com/E3SM-Project/E3SM/issues/3061

However, above is a our default F case.

Note that with threading, which is how it might run by default, I get a different error (earlier) with DEBUG=TRUE that is still not solved (https://github.com/E3SM-Project/E3SM/issues/2131)

ndkeen commented 5 years ago

F-cases using ne4 and ne11 do not fail in the same way.

ndkeen commented 5 years ago

Interesting that the threaded DEBUG case does not stop there. I added an init to zero as a test here:

cam/src/physics/cam/clubb_intr.F90

    if (is_first_step()) then
...
        call pbuf_set_field(pbuf2d, radf_idx,    0.0_r8)
+       call pbuf_set_field(pbuf2d, qrl_idx,    0.0_r8) ! ndk

And the problem goes away. If this is the proper place to init, I see other arrays that may need to be set here as well.

ndkeen commented 5 years ago

Discussing this with @singhbalwinder on slack.

ndkeen commented 5 years ago

@vlarson suggested we "set qrl to zero after it is added to pbuf". If that's not what I did above, then I may need help doing this. If it is, I can make a PR.

rljacob commented 5 years ago

@vlarson can someone be assigned to work on this?

rljacob commented 5 years ago

@vlarson Who should work on this?

vlarson commented 4 years ago

@singhbalwinder, Noel and I talked, and it seems OK to me to initialize qrl to zero in clubb_intr, as suggested by Noel. What do you think? How do we make this happen?