ESCOMP / mizuRoute

Reach-based river routing model
http://escomp.github.io/mizuRoute/
GNU General Public License v3.0
39 stars 51 forks source link

Transition to intel-oneapi from intel on Derecho #463

Closed ekluzek closed 2 months ago

ekluzek commented 2 months ago

We need to transition mizuRoute testslist on Derecho from intel to intel-oneapi

See this CTSM issue for more details: https://github.com/ESCOMP/CTSM/issues/2476

nmizukami commented 2 months ago

Tested standalone with this module:

1) ncarenv/23.09 (S)   2) craype/2.7.23   3) cmake/3.26.3   4) intel-oneapi/2024.0.2   5) hdf5/1.14.3   6) netcdf/4.9.2   7) cray-mpich/8.1.27   8) parallel-netcdf/1.12.3   9) ncarcompilers/1.0.0

Compile under sandbox_mizuRoute/route/build/

gmake FC=intel FC_EXE=mpif90 F_MASTER=$BLDDIR NCDF_PATH=$NETCDF PNETCDF_PATH=$PNETCDF MODE=fast EXE=test

Then ran with some test case

under /glade/work/mizukami/test_mizuRoute/HDMA_global ./test settings/HDMA_CLM5-runoff.control

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source             
libpthread-2.31.s  00007FEFAE3EF8C0  Unknown               Unknown  Unknown
libhdf5.so.310.3.  00007FEFA800C654  H5T__init_native_     Unknown  Unknown
libhdf5.so.310.3.  00007FEFA7F3FE96  H5T_init              Unknown  Unknown
libhdf5.so.310.3.  00007FEFA802A679  H5VL_init_phase2      Unknown  Unknown
libhdf5.so.310.3.  00007FEFA7D26141  H5_init_library       Unknown  Unknown
libhdf5.so.310.3.  00007FEFA7DC348C  H5Eset_auto2          Unknown  Unknown
libnetcdf.so.19    00007FEFAF6A1F6C  nc4_hdf5_initiali     Unknown  Unknown
libnetcdf.so.19    00007FEFAF6AA497  NC_HDF5_initializ     Unknown  Unknown
libnetcdf.so.19    00007FEFAF62F428  nc_initialize         Unknown  Unknown
libnetcdf.so.19    00007FEFAF6323C6  NC_open               Unknown  Unknown
libnetcdf.so.19    00007FEFAF6322B4  nc_open               Unknown  Unknown
libnetcdff.so.7    00007FEFAFA49511  nf_open_              Unknown  Unknown
libnetcdff.so.7    00007FEFAFB0F6EB  Unknown               Unknown  Unknown
libnetcdff.so.7    00007FEFAFAA2725  netcdf_mp_nf90_op     Unknown  Unknown
test               0000000000417DC3  Unknown               Unknown  Unknown
test               00000000005072BB  Unknown               Unknown  Unknown
test               0000000000506B43  Unknown               Unknown  Unknown
test               00000000005067BB  Unknown               Unknown  Unknown
test               0000000000512078  Unknown               Unknown  Unknown
test               000000000041264D  Unknown               Unknown  Unknown
libc-2.31.so       00007FEFAA03E29D  __libc_start_main     Unknown  Unknown
test               000000000041257A  Unknown               Unknown  Unknown
Aborted (core dumped)

The error occurs when the code is trying to open river input netcdf.

Compilation with debug mode produces even unclear output.... maybe some compilation flag is not correct?

Uninitialized bytes in strlen at offset 0 inside [0x7010000003a0, 1)
==43662==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x7fb62149ae2b in MPIDI_CRAY_collopt_process_env (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x1dc5e2b) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a)
    #1 0x7fb62149b45c in MPIDI_Cray_coll_init (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x1dc645c) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a)
    #2 0x7fb6217f1de4 in MPID_Init (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x211cde4) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a)
    #3 0x7fb61fe1ed84 in MPIR_Init_thread (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x749d84) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a)
    #4 0x7fb61fe1eb53 in MPI_Init (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x749b53) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a)
    #5 0x7fb6224995de in pmpi_init__ (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpifort_intel.so.12+0x4d5de) (BuildId: 63521a851ceb7a35393a775072a346557973adee)
    #6 0x89c05b in mpi_utils_mp_shr_mpi_init_ /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/mpi_utils.f90:919:10
    #7 0x18db479 in model_setup_mp_init_mpi_ /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/model_setup.f90:37:8
    #8 0x1989315 in MAIN__ /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:57:6
    #9 0x418d38 in main (/glade/u/home/mizukami/sandbox_mizuRoute/route/bin/test+0x418d38) (BuildId: 4c39516a27ef6b37be82ead224660dbb57c7fd59)
    #10 0x7fb61e35829c in __libc_start_main (/lib64/libc.so.6+0x3529c) (BuildId: c8417d767baccfadb39b474e484d46947915cd8f)
    #11 0x418c19 in _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120

  Uninitialized value was created by a heap allocation
    #0 0x426616 in malloc (/glade/u/home/mizukami/sandbox_mizuRoute/route/bin/test+0x426616) (BuildId: 4c39516a27ef6b37be82ead224660dbb57c7fd59)
    #1 0x7fb62149a0e4 in MPIDI_CRAY_collopt_process_env (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x1dc50e4) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a)

SUMMARY: MemorySanitizer: use-of-uninitialized-value (/opt/cray/pe/mpich/8.1.27/ofi/intel/2022.1/lib/libmpi_intel.so.12+0x1dc5e2b) (BuildId: 9050a3fd8814e8f4645b0e5108ad020e92954f4a) in MPIDI_CRAY_collopt_process_env
Exiting
nmizukami commented 2 months ago

when I use intel-oneapi/2023.2.1, which is default on Derecho now, I cannot get it compiled. The compilation error is below. I don't see what is wrong with the code....

          #0 0x00000000021c51e2
          #1 0x0000000002228e97
          #2 0x0000000002228e66
          #3 0x00000000022b34bd
          #4 0x0000000002299cd3
          #5 0x00000000022b4a57
          #6 0x00000000022a844c
          #7 0x00000000022a8060
          #8 0x00000000022c92eb
          #9 0x00000000022c6602
         #10 0x00000000022c5d4b
         #11 0x0000000002278113
         #12 0x000000000226ee01
         #13 0x000000000226cf83
         #14 0x0000000002277705
         #15 0x0000000002277ce4
         #16 0x00000000021fec79
         #17 0x00000000021fe8a0
         #18 0x00000000021fea4d
         #19 0x00000000021ff14c
         #20 0x0000000002277705
         #21 0x0000000002277ce4
         #22 0x00000000021fec79
         #23 0x00000000021fe8a0
         #24 0x00000000021fea4d
         #25 0x00000000021ff14c
         #26 0x0000000002277705
         #27 0x0000000002277ce4
         #28 0x0000000002274ce6
         #29 0x0000000002277705
         #30 0x0000000002277ce4
         #31 0x000000000227a159
         #32 0x0000000002277705
         #33 0x0000000002277ce4
         #34 0x00000000022752b2
         #35 0x0000000002277705
         #36 0x000000000227495a
         #37 0x0000000002277705
         #38 0x0000000002111c05
         #39 0x00000000021115bd
         #40 0x00000000022e13ce
         #41 0x00007fd08610329d __libc_start_main + 239
         #42 0x0000000001f51aa9

/glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/csv_data.f90(321): error #5623: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.
          csv_data(i,j) = this%csv_data(i,j)%str
----------^
compilation aborted for /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/csv_data.f90 (code 3)
nmizukami commented 2 months ago

Looks like the code is compiled with intel-oneapi/2023.2.1. These modules are loaded for compiling and running the exe

module load intel-oneapi
module load cray-mpich
module load craype
module load ncarcompilers
module load netcdf
module load parallel-netcdf

This version of compiler does not like do concurrent loops in [csv_data.f90], (https://github.com/ESCOMP/mizuRoute/blob/a9da911a8d9e88ddc7e3713bd451d2c13cc1b173/route/build/src/csv_data.f90#L319). This causes the compiling error I posted above. If I change these to regular do loops, it is compiled. I am not sure if this is compiler bug??

If I use intel-oneapi/2024.0.2, I am not able to link netcdf correctly. it is compiled, but runtime error (cannot open netCDF)