Closed ndkeen closed 6 years ago
I can repeat the failure with only 85 nodes using SMS_D_P5440.ne120_ne120.FC5AV1C-H01A
Looking at the code, it seems fine.
The failure is at the last deallocate for the rstate object, after it has deallocated all of its members. I will try a few things.
subroutine rrtmg_state_destroy(rstate)
implicit none
type(rrtmg_state_t), pointer :: rstate
deallocate(rstate%h2ovmr)
deallocate(rstate%o3vmr)
deallocate(rstate%co2vmr)
deallocate(rstate%ch4vmr)
deallocate(rstate%o2vmr)
deallocate(rstate%n2ovmr)
deallocate(rstate%cfc11vmr)
deallocate(rstate%cfc12vmr)
deallocate(rstate%cfc22vmr)
deallocate(rstate%ccl4vmr)
deallocate(rstate%pmidmb)
deallocate(rstate%pintmb)
deallocate(rstate%tlay)
deallocate(rstate%tlev)
deallocate( rstate ) !<-- this line
nullify(rstate)
endsubroutine rrtmg_state_destroy
OK, the above was using the "default" version 18 on the machine (intel/18.0.0.128), but there is one version higher: intel/18.0.1.163. Trying that and I don't see any issues. All other tests pass as well, so I will close this issue.
I have been trying to run maint-1.0 with an F20TRC5-CMIP6 low-res case on cori-knl and have been stymied by intermittent failures that have symptoms identical to those reported here. In various runs, the failure has occurred anywhere from 43 time steps into the simulation to 8+ months in. The timing of the failure is not reproducible. I kept a log of my troubleshooting efforts here: https://acme-climate.atlassian.net/wiki/spaces/EBGC/pages/1310655882/Atmosphere+Only+Simulations+in+support+of+the+v2+BGC+Campaign
I tried applying the fix from https://github.com/E3SM-Project/E3SM/pull/3324 which appeared potentially related (and which is not on maint-1.0 currently), but this did not resolve the issue. It was also not resolved by using version 18.0.1.163 of the intel compiler (which @ndkeen references above), nor by switching to other versions of the intel compiler currently available on cori.
So: If I have tested everything correctly, I think this issue persists on maint-1.0, at least in the F20TRC5-CMIP6 low-res case.
@kvcalvin @rljacob : just adding notifications to you as an FYI. At this point, I am not intending to do any further troubleshooting on this issue unless we decide otherwise.
I have a branch
ndk/machinefiles/cori-default-intel18
where I'm wanting to change the default compiler on cori-knl from v17 to v18. All other tests I've tried pass, but I did hit a problem with the high res F case. The following are using the (current) default of 675 nodes. I ran with DEBUG=TRUE and hit the error:case:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/mfdflt18/SMS_D.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20180104_111658_xe7sd9
The non-DEBUG attempt failed differently:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/mfdflt18/SMS.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20180103_171123_ftjxxo
It's possible that Intel 18 (along with minor flag changes) are an issue, but I think it's more likley the compiler has found something others missed.
I realize this isn't in master yet, but wanted to record info and hopefully someone might know more. I've already submitted other runs using fewer nodes (and one without threads).
OK, apparently running the same (non-DEBUG) test again passed:
SMS.ne120_ne120.FC5AV1C-H01A.cori-knl_intel
The error above in m_AttrVect.F90 was inside of an OpenMP loop, so it's still possible there's an issue. Passing test:SMS.ne120_ne120.FC5AV1C-H01A.cori-knl_intel.20180104_110519_m4ibyt
And I ran with 1 thread without issue:
SMS_PMx1.ne120_ne120.FC5AV1C-H01A.cori-knl_intel
As well as a 338-node F-case ran OK:
SMS.ne120_ne120.mfdflt18.n0338p21632t64x2
So perhaps, we can ignore the non-DEBUG error and I will try to recreate the DEBUG=true issue with smallest possible test. I haven't looked at the code for that one yet.