Closed ekluzek closed 2 years ago
It looks like the fix is that nvhpc needs to use pio/2.5.5 in it's module loads rather than pio/2.5.6.
Using pio/2.5.5 gets it to build, but it fails at runtime in PIO as follows. So I'll ask CISL to install pio/2.5.6 for nvhpc on cheyenne.
[r4i0n31:66450:0:66450] Caught signal 8 (Floating point exception: floating-point invalid operation)
[r4i0n32:43223:0:43223] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid: 66450) ====
0 /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2ad63025f1a4]
1 /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a4cc) [0x2ad63025f4cc]
2 /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a853) [0x2ad63025f853]
3 /glade/u/apps/ch/opt/openmpi/4.1.1/nvhpc/21.11/lib/libmca_common_ompio.so.41(mca_common_ompio_simple_grouping+0x8e) [0x2ad63f4938ce]
4 /glade/u/apps/ch/opt/openmpi/4.1.1/nvhpc/21.11/lib/libmca_common_ompio.so.41(mca_common_ompio_set_view+0x937) [0x2ad63f49c737]
5 /glade/u/apps/ch/opt/openmpi/4.1.1/nvhpc/21.11/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_set_view+0xc7) [0x2ad6404e6347]
6 /glade/u/apps/ch/opt/openmpi/4.1.1/nvhpc/21.11/lib/libmpi.so.40(PMPI_File_set_view+0x1a4) [0x2ad612494324]
7 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(ncmpio_file_set_view+0x161) [0x2ad60aaea5a1]
8 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(+0x5479a2) [0x2ad60aac89a2]
9 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(+0x546529) [0x2ad60aac7529]
10 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(+0x545386) [0x2ad60aac6386]
11 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(+0x544a92) [0x2ad60aac5a92]
12 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(ncmpio_wait+0x9f) [0x2ad60aac561f]
13 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(ncmpio_get_varn+0x9f) [0x2ad60aac48df]
14 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(ncmpi_get_varn_all+0x2d7) [0x2ad60aa5b357]
15 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(pio_read_darray_nc+0x436) [0x2ad60a63c2f6]
16 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpioc.so.5(PIOc_read_darray+0x331) [0x2ad60a619171]
17 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpiof.so.4(piodarray_read_darray_internal_double_+0x1b) [0x2ad60a355f9b]
18 /glade/u/apps/ch/opt/pio/2.5.5/openmpi/4.1.1/nvhpc/21.11/lib/libpiof.so.4(piodarray_read_darray_1d_double_+0x94) [0x2ad60a353514]
19 /glade/scratch/erik/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.20220428_162424_fumxti/bld/cesm.exe() [0x164c232]
Using cesm2_3_alpha08d I'm able to get the build to work. And there are a few nvhpc tests now in CESM. The CESM version uses an updated version of ccs_config, hence this issue is already fixed. I had talked to CISL about adding a new install, but they realized there already was a setup for a newer version of nvhcp, and since this is working they don't need to do that either. So I'm closing this as well.
I see the following issue with case.setup, when I try to run cheyenne_nvhpc
the specific test case is:
SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop
using ctsm5.1.dev091 which has ccs_config_cesm0.0.15
It looks like the fix is pretty simple and I can make a PR for it.