ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
305 stars 308 forks source link

`nvhpc` compiler tests are failing on cheyenne/derecho #1733

Open ekluzek opened 2 years ago

ekluzek commented 2 years ago

Brief summary of bug

MPI tests with DEBUG on are failing at runtime with the nvhpc compiler on cheyenne. This continues in ctsm5.1.dev155-38-g5c8f17b1a (derecho1 branch) on derecho

General bug information

CTSM version you are using: ctsm5.1.dev082 in cesm2_3_alpha08d

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: tests with nvhpc and DEBUG on

Details of bug

These tests fail:

SMS_D.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_D.f45_f45_mg37.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default

While DEBUG off tests PASS:

SMS.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default

As well as mpi-serial tests:

SMS_D_Ld1_Mmpi-serial.1x1_brazil.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Ld1_Mmpi-serial.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
SMS_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default

Important details of your setup / configuration so we can reproduce the bug

Important output or errors that show the problem

For the smallest case: SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default

The only log file available is the cesm.log file as follows.

cesm.log file:

 (t_initf)       profile_single_file=       F
 (t_initf)       profile_global_stats=      T
 (t_initf)       profile_ovhd_measurement=  F
 (t_initf)       profile_add_detail=        F
 (t_initf)       profile_papi_enable=       F
[r12i4n4:35002:0:35002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35003:0:35003] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35004:0:35004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35006:0:35006] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35007:0:35007] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35008:0:35008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35010:0:35010] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35011:0:35011] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35012:0:35012] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35013:0:35013] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35014:0:35014] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35015:0:35015] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35017:0:35017] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35018:0:35018] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35019:0:35019] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35020:0:35020] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35022:0:35022] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35000:0:35000] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35001:0:35001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35016:0:35016] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==== backtrace (tid:  35022) ====
 0  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2ba9d97301a4]
 1  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a4cc) [0x2ba9d97304cc]
 2  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a73b) [0x2ba9d973073b]
 3  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6LogErr13MsgFoundErrorEiPKciS2_S2_Pi+0x34) [0x2ba9b78f4c74]
 4  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap22meshcreatenodedistgridEPi+0x7f) [0x2ba9b7b15ebf]
 5  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatenodedistgrid_+0xc1) [0x2ba9b7b61141]
 6  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshaddelements_+0xbc0) [0x2ba9b881c880]
 7  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromunstruct_+0x4d0f) [0x2ba9b88246cf]
 8  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromfile_+0x270) [0x2ba9b881f270]
 9  /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x15d8fd0]
10  /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x632341]
11  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc30) [0x2ba9b77436b0]
12  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ba9b773e913]
13  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ba9b7f7b9fb]
14  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ba9b7fa3bbe]
15  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ba9b773edd3]
16  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xa26) [0x2ba9b82d2c66]
17  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ba9b85a5ede]
glemieux commented 1 year ago

Updating to ccs_config_cesm0.0.65 via #2000 now results in all the nvhpc tests on cheyenne failing run times. It is expected that updating to cesm2_3_beta15 will resolve this.

ekluzek commented 11 months ago

In the CESM3_dev branch two of the tests now PASS:

SMS.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_nvhpc.clm-crop FAILED PREVIOUSLY SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.cheyenne_nvhpc.clm-FatesColdSatPhen FAILED PREVIOUSLY

While this one still fails, but now with a floating point exception

SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop EXPECTED

The cesm.log file shows that there is a problem in ESMF at initialization in creating an ESMF mesh. It doesn't drop PET files by default in this case...

cesm.log:

[1,0]<stderr>: (t_initf)       profile_papi_enable=       F
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>:          912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>:          -18
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>:          912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>:          -25
[1,0]<stderr>:[r3i7n18:45933:0:45933] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,36]<stderr>:[r3i7n33:33507:0:33507] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,0]<stderr>:==== backtrace (tid:  45933) ====
[1,0]<stderr>: 0  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(ucs_handle_error+0x134) [0x2ae710b0fd74]
[1,0]<stderr>: 1  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e0dc) [0x2ae710b100dc]
[1,0]<stderr>: 2  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e463) [0x2ae710b10463]
[1,0]<stderr>: 3  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_simple_grouping+0xe4) [0x2ae71fa93a64]
[1,0]<stderr>: 4  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_set_view+0x937) [0x2ae71fa9c877]
[1,0]<stderr>: 5  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_set_view+0xc7) [0x2ae720cf2347]
[1,0]<stderr>: 6  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmpi.so.40(PMPI_File_set_view+0x1a4) [0x2ae6f30a68e4]
[1,0]<stderr>: 7  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_file_set_view+0x161) [0x2ae6f034d4a1]
[1,0]<stderr>: 8  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e28e2) [0x2ae6f032b8e2]
[1,0]<stderr>: 9  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e1469) [0x2ae6f032a469]
[1,0]<stderr>:10  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e02c6) [0x2ae6f03292c6]
[1,0]<stderr>:11  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32df9d2) [0x2ae6f03289d2]
[1,0]<stderr>:12  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_wait+0x9f) [0x2ae6f032855f]
[1,0]<stderr>:13  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_get_varn+0x9f) [0x2ae6f032781f]
[1,0]<stderr>:14  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpi_get_varn_all+0x2d7) [0x2ae6f02be097]
[1,0]<stderr>:15  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x1a758be]
[1,0]<stderr>:16  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe(PIOc_read_darray+0x413) [0x1a72c53]
[1,0]<stderr>:17  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z37get_numElementConn_from_ESMFMesh_fileiiPcxiPxRPi+0x48e) [0x2ae6ee1c7d8e]
[1,0]<stderr>:18  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z42get_elemConn_info_2Dvar_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x99) [0x2ae6ee1c9c19]
[1,0]<stderr>:19  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36get_elemConn_info_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x28c) [0x2ae6ee1caa4c]
[1,0]<stderr>:20  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36ESMCI_mesh_create_from_ESMFMesh_fileiPcb18ESMC_CoordSys_FlagPN5ESMCI8DistGridEPPNS1_4MeshE+0x63a) [0x2ae6ee6bc87a]
[1,0]<stderr>:21  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z27ESMCI_mesh_create_from_filePc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS_PN5ESMCI8DistGridES5_PPNS3_4MeshEPi+0x2eb) [0x2ae6ee6bb8eb]
[1,0]<stderr>:22  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap21meshcreatefromfilenewEPc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS1_PNS_8DistGridES6_Pi+0x99) [0x2ae6ee675919]
[1,0]<stderr>:23  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatefromfile_+0x1a7) [0x2ae6ee6c51a7]
[1,0]<stderr>:24  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreat[1,0]<stderr>:efromfile_+0x217) [0x2ae6ef401fd7]
[1,0]<stderr>:25  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x17668d1]
[1,0]<stderr>:26  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x61af01]
[1,0]<stderr>:27  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:28  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:29  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:30  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:31  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:32  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:33  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]
[1,0]<stderr>:34  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_loopmodelcompss_+0x1036) [0x2ae6ef8ad876]
[1,0]<stderr>:35  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_initializeipdv02p3_+0x2208) [0x2ae6ef89fcc8]
[1,0]<stderr>:36  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:37  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:38  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:39  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:40  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:41  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:42  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]
ekluzek commented 10 months ago

Seeing similar errors on Derecho:

These PASS: SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen

These FAIL: ERP_D_P128x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.derecho_nvhpc.clm-crop SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.derecho_nvhpc.clm-ptsRLA SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default

The fails are all in the build now with error message from FATES code like this:

Lowering Error: symbol hlm_pft_map$sd is an inconsistent array descriptor
NVFORTRAN-F-0000-Internal compiler error. Errors in Lowering       1  (/glade/work/erik/ctsm_worktrees/external_updates/src/fates/main/EDPftvarcon.F90: 2191)
NVFORTRAN/x86-64 Linux 23.5-0: compilation aborted
gmake: *** [/glade/derecho/scratch/erik/tests_ctsm51d155derechoacl/SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default.GC.ctsm51d155derechoacl_nvh/Tools/Makefile:978: EDPftvarcon.o] Error 2
gmake: *** Waiting for unfinished jobs....

Looking at the code I don't see an obvious problem. I googled about it and there are some NVIDIA nvhpc reports about these kind of errors. But, it's not obvious what the issue is here or how to fix it.

ekluzek commented 5 months ago

A reminder that nvhpc is important for the flexibility to be able to start using GPU's, and since Derecho has NVIDIA GPU's nvhpc is going to be the most performant compiler on Derecho for it's GPU's.

Even though GPU's don't currently look like they are important for most uses of CTSM. This will be important for ultra high resolution. And in the future as hardware changes it's important to have flexibility in the model to take advantage of different types of hardware in order to keep the model working well.

ekluzek commented 5 months ago

Corrected that Derecho has NVIDIA GPU's.. And from talking with @sherimickelson and slides presented by her group on Sep/12th/2023 CSEG meeting, nvhpc and cray compilers work for the Derecho GPU's, but intel-oneapi wasn't at the time.

ekluzek commented 5 months ago

We talked about this in the CSEG meeting. The takeaways are:

Jim feels that we do want to test with NVHPC, so that we know if things start failing. If we need to write a bug report, we can do that, and then move on. Brian: agrees that testing with it is good, but supporting nvhpc shouldn’t be a requirement for CESM3.

sherimickelson commented 5 months ago

This is great news and thanks, @ekluzek for sharing this and for your support.