E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
348 stars 359 forks source link

SMS_D.ne120_ne120.FC5AV1C-H01A.cori-knl_intel fail with Intel 18 #2007

Closed ndkeen closed 6 years ago

ndkeen commented 6 years ago

I have a branch ndk/machinefiles/cori-default-intel18 where I'm wanting to change the default compiler on cori-knl from v17 to v18. All other tests I've tried pass, but I did hit a problem with the high res F case. The following are using the (current) default of 675 nodes. I ran with DEBUG=TRUE and hit the error:

36890: forrtl: severe (153): allocatable array or pointer is not allocated
36890: Image              PC                Routine            Line        Source             
36890: acme.exe           000000000BC9CA1C  Unknown               Unknown  Unknown
36890: acme.exe           00000000019346B2  rrtmg_state_mp_rr         236 rrtmg_state.F90
36890: acme.exe           0000000000E287BD  radiation_mp_radi        1286  radiation.F90
36890: acme.exe           0000000000D46197  physpkg_mp_tphysb        2673  physpkg.F90
36890: acme.exe           0000000000D12BCA  physpkg_mp_phys_r        1027  physpkg.F90
36890: acme.exe           000000000A2BC593  Unknown               Unknown  Unknown
36890: acme.exe           000000000A2742A0  Unknown               Unknown  Unknown
36890: acme.exe           000000000A273535  Unknown               Unknown  Unknown
36890: acme.exe           000000000A2BC9C9  Unknown               Unknown  Unknown
36890: acme.exe           000000000B653094  Unknown               Unknown  Unknown
36890: acme.exe           000000000BE1B2E9  Unknown               Unknown  Unknown

case: /global/cscratch1/sd/ndk/acme_scratch/cori-knl/mfdflt18/SMS_D.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20180104_111658_xe7sd9

The non-DEBUG attempt failed differently:

07292: forrtl: severe (154): array index out of bounds
07292: Image              PC                Routine            Line        Source             
07292: acme.exe           0000000004893B9E  Unknown               Unknown  Unknown
07292: acme.exe           0000000004254600  Unknown               Unknown  Unknown
07292: acme.exe           0000000002F801F6  m_attrvect_mp_rco        2614  m_AttrVect.F90
07292: acme.exe           0000000003EAC243  Unknown               Unknown  Unknown
07292: acme.exe           0000000003E63F50  Unknown               Unknown  Unknown
07292: acme.exe           0000000003E631E5  Unknown               Unknown  Unknown
07292: acme.exe           0000000003EAC679  Unknown               Unknown  Unknown
07292: acme.exe           00000000042511B4  Unknown               Unknown  Unknown
07292: acme.exe           0000000004A266B9  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/mfdflt18/SMS.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20180103_171123_ftjxxo

It's possible that Intel 18 (along with minor flag changes) are an issue, but I think it's more likley the compiler has found something others missed.

I realize this isn't in master yet, but wanted to record info and hopefully someone might know more. I've already submitted other runs using fewer nodes (and one without threads).

OK, apparently running the same (non-DEBUG) test again passed: SMS.ne120_ne120.FC5AV1C-H01A.cori-knl_intel The error above in m_AttrVect.F90 was inside of an OpenMP loop, so it's still possible there's an issue. Passing test: SMS.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20180104_110519_m4ibyt

And I ran with 1 thread without issue: SMS_PMx1.ne120_ne120.FC5AV1C-H01A.cori-knl_intel

As well as a 338-node F-case ran OK: SMS.ne120_ne120.mfdflt18.n0338p21632t64x2

So perhaps, we can ignore the non-DEBUG error and I will try to recreate the DEBUG=true issue with smallest possible test. I haven't looked at the code for that one yet.

ndkeen commented 6 years ago

I can repeat the failure with only 85 nodes using SMS_D_P5440.ne120_ne120.FC5AV1C-H01A Looking at the code, it seems fine. The failure is at the last deallocate for the rstate object, after it has deallocated all of its members. I will try a few things.

 subroutine rrtmg_state_destroy(rstate)

    implicit none

    type(rrtmg_state_t), pointer   :: rstate

    deallocate(rstate%h2ovmr)
    deallocate(rstate%o3vmr)
    deallocate(rstate%co2vmr)
    deallocate(rstate%ch4vmr)
    deallocate(rstate%o2vmr)
    deallocate(rstate%n2ovmr)
    deallocate(rstate%cfc11vmr)
    deallocate(rstate%cfc12vmr)
    deallocate(rstate%cfc22vmr)
    deallocate(rstate%ccl4vmr)

    deallocate(rstate%pmidmb)
    deallocate(rstate%pintmb)
    deallocate(rstate%tlay)
    deallocate(rstate%tlev)

    deallocate( rstate ) !<-- this line
    nullify(rstate)

  endsubroutine rrtmg_state_destroy
ndkeen commented 6 years ago

OK, the above was using the "default" version 18 on the machine (intel/18.0.0.128), but there is one version higher: intel/18.0.1.163. Trying that and I don't see any issues. All other tests pass as well, so I will close this issue.

susburrows commented 4 years ago

I have been trying to run maint-1.0 with an F20TRC5-CMIP6 low-res case on cori-knl and have been stymied by intermittent failures that have symptoms identical to those reported here. In various runs, the failure has occurred anywhere from 43 time steps into the simulation to 8+ months in. The timing of the failure is not reproducible. I kept a log of my troubleshooting efforts here: https://acme-climate.atlassian.net/wiki/spaces/EBGC/pages/1310655882/Atmosphere+Only+Simulations+in+support+of+the+v2+BGC+Campaign

I tried applying the fix from https://github.com/E3SM-Project/E3SM/pull/3324 which appeared potentially related (and which is not on maint-1.0 currently), but this did not resolve the issue. It was also not resolved by using version 18.0.1.163 of the intel compiler (which @ndkeen references above), nor by switching to other versions of the intel compiler currently available on cori.

So: If I have tested everything correctly, I think this issue persists on maint-1.0, at least in the F20TRC5-CMIP6 low-res case.

susburrows commented 4 years ago

@kvcalvin @rljacob : just adding notifications to you as an FYI. At this point, I am not intending to do any further troubleshooting on this issue unless we decide otherwise.