Open ekluzek opened 2 years ago
Updating to ccs_config_cesm0.0.65
via #2000 now results in all the nvhpc
tests on cheyenne failing run times. It is expected that updating to cesm2_3_beta15
will resolve this.
In the CESM3_dev branch two of the tests now PASS:
SMS.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_nvhpc.clm-crop FAILED PREVIOUSLY SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.cheyenne_nvhpc.clm-FatesColdSatPhen FAILED PREVIOUSLY
While this one still fails, but now with a floating point exception
SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop EXPECTED
The cesm.log file shows that there is a problem in ESMF at initialization in creating an ESMF mesh. It doesn't drop PET files by default in this case...
cesm.log:
[1,0]<stderr>: (t_initf) profile_papi_enable= F
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>: 912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>: -18
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>: 912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>: -25
[1,0]<stderr>:[r3i7n18:45933:0:45933] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,36]<stderr>:[r3i7n33:33507:0:33507] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,0]<stderr>:==== backtrace (tid: 45933) ====
[1,0]<stderr>: 0 /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(ucs_handle_error+0x134) [0x2ae710b0fd74]
[1,0]<stderr>: 1 /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e0dc) [0x2ae710b100dc]
[1,0]<stderr>: 2 /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e463) [0x2ae710b10463]
[1,0]<stderr>: 3 /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_simple_grouping+0xe4) [0x2ae71fa93a64]
[1,0]<stderr>: 4 /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_set_view+0x937) [0x2ae71fa9c877]
[1,0]<stderr>: 5 /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_set_view+0xc7) [0x2ae720cf2347]
[1,0]<stderr>: 6 /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmpi.so.40(PMPI_File_set_view+0x1a4) [0x2ae6f30a68e4]
[1,0]<stderr>: 7 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_file_set_view+0x161) [0x2ae6f034d4a1]
[1,0]<stderr>: 8 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e28e2) [0x2ae6f032b8e2]
[1,0]<stderr>: 9 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e1469) [0x2ae6f032a469]
[1,0]<stderr>:10 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e02c6) [0x2ae6f03292c6]
[1,0]<stderr>:11 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32df9d2) [0x2ae6f03289d2]
[1,0]<stderr>:12 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_wait+0x9f) [0x2ae6f032855f]
[1,0]<stderr>:13 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_get_varn+0x9f) [0x2ae6f032781f]
[1,0]<stderr>:14 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpi_get_varn_all+0x2d7) [0x2ae6f02be097]
[1,0]<stderr>:15 /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x1a758be]
[1,0]<stderr>:16 /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe(PIOc_read_darray+0x413) [0x1a72c53]
[1,0]<stderr>:17 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z37get_numElementConn_from_ESMFMesh_fileiiPcxiPxRPi+0x48e) [0x2ae6ee1c7d8e]
[1,0]<stderr>:18 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z42get_elemConn_info_2Dvar_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x99) [0x2ae6ee1c9c19]
[1,0]<stderr>:19 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36get_elemConn_info_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x28c) [0x2ae6ee1caa4c]
[1,0]<stderr>:20 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36ESMCI_mesh_create_from_ESMFMesh_fileiPcb18ESMC_CoordSys_FlagPN5ESMCI8DistGridEPPNS1_4MeshE+0x63a) [0x2ae6ee6bc87a]
[1,0]<stderr>:21 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z27ESMCI_mesh_create_from_filePc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS_PN5ESMCI8DistGridES5_PPNS3_4MeshEPi+0x2eb) [0x2ae6ee6bb8eb]
[1,0]<stderr>:22 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap21meshcreatefromfilenewEPc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS1_PNS_8DistGridES6_Pi+0x99) [0x2ae6ee675919]
[1,0]<stderr>:23 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatefromfile_+0x1a7) [0x2ae6ee6c51a7]
[1,0]<stderr>:24 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreat[1,0]<stderr>:efromfile_+0x217) [0x2ae6ef401fd7]
[1,0]<stderr>:25 /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x17668d1]
[1,0]<stderr>:26 /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x61af01]
[1,0]<stderr>:27 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:28 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:29 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:30 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:31 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:32 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:33 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]
[1,0]<stderr>:34 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_loopmodelcompss_+0x1036) [0x2ae6ef8ad876]
[1,0]<stderr>:35 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_initializeipdv02p3_+0x2208) [0x2ae6ef89fcc8]
[1,0]<stderr>:36 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:37 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:38 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:39 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:40 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:41 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:42 /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]
Seeing similar errors on Derecho:
These PASS: SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen
These FAIL: ERP_D_P128x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.derecho_nvhpc.clm-crop SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.derecho_nvhpc.clm-ptsRLA SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default
The fails are all in the build now with error message from FATES code like this:
Lowering Error: symbol hlm_pft_map$sd is an inconsistent array descriptor
NVFORTRAN-F-0000-Internal compiler error. Errors in Lowering 1 (/glade/work/erik/ctsm_worktrees/external_updates/src/fates/main/EDPftvarcon.F90: 2191)
NVFORTRAN/x86-64 Linux 23.5-0: compilation aborted
gmake: *** [/glade/derecho/scratch/erik/tests_ctsm51d155derechoacl/SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default.GC.ctsm51d155derechoacl_nvh/Tools/Makefile:978: EDPftvarcon.o] Error 2
gmake: *** Waiting for unfinished jobs....
Looking at the code I don't see an obvious problem. I googled about it and there are some NVIDIA nvhpc reports about these kind of errors. But, it's not obvious what the issue is here or how to fix it.
A reminder that nvhpc is important for the flexibility to be able to start using GPU's, and since Derecho has NVIDIA GPU's nvhpc is going to be the most performant compiler on Derecho for it's GPU's.
Even though GPU's don't currently look like they are important for most uses of CTSM. This will be important for ultra high resolution. And in the future as hardware changes it's important to have flexibility in the model to take advantage of different types of hardware in order to keep the model working well.
Corrected that Derecho has NVIDIA GPU's.. And from talking with @sherimickelson and slides presented by her group on Sep/12th/2023 CSEG meeting, nvhpc and cray compilers work for the Derecho GPU's, but intel-oneapi wasn't at the time.
We talked about this in the CSEG meeting. The takeaways are:
Jim feels that we do want to test with NVHPC, so that we know if things start failing. If we need to write a bug report, we can do that, and then move on. Brian: agrees that testing with it is good, but supporting nvhpc shouldn’t be a requirement for CESM3.
This is great news and thanks, @ekluzek for sharing this and for your support.
Brief summary of bug
MPI tests with DEBUG on are failing at runtime with the nvhpc compiler on cheyenne. This continues in ctsm5.1.dev155-38-g5c8f17b1a (derecho1 branch) on derecho
General bug information
CTSM version you are using: ctsm5.1.dev082 in cesm2_3_alpha08d
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: tests with nvhpc and DEBUG on
Details of bug
These tests fail:
SMS_D.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_D.f45_f45_mg37.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
While DEBUG off tests PASS:
SMS.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
As well as mpi-serial tests:
SMS_D_Ld1_Mmpi-serial.1x1_brazil.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Ld1_Mmpi-serial.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
SMS_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
Important details of your setup / configuration so we can reproduce the bug
Important output or errors that show the problem
For the smallest case: SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
The only log file available is the cesm.log file as follows.
cesm.log file: