ESCOMP / mizuRoute

Reach-based river routing model
http://escomp.github.io/mizuRoute/
GNU General Public License v3.0
44 stars 53 forks source link

PIO error when using gnu (> v10.1.0) and MPT #359

Open nmizukami opened 1 year ago

nmizukami commented 1 year ago

When using gnu compiler with MPT, PIO sync fails (seemingly randomly) as segmentation fault (invalid memory reference).

Using intel compiler with MPT works fine.
Using gnu with openmpi works fine (seems to be).
This error happen with mizuRoute with large high resolution river network data (MERIT-Hydro)

I have been running into this problem for long time (for several years now).

More specific configuration is: gnu v12.1.0 netcdf v 4.8.1 pnetcdf v1.12.3 mpt v2.25

The trace back looks like this (run with debug mode: flag is -g -Wall -fmax-errors=0 -fbacktrace -fcheck=all). 14 through 25 are not displayed: they would be in C codes.

piolib_mod.F90 Line 1372 is just PIOc_sync(file%fh)

#13  0x2b9d2f8c8f66 in PMPI_File_write_at_all
    at /usr/src/packages/BUILD/mpt/lib/libmpi/src/romio/mpi-io/write_atall.c:61
#14  0xc53728 in ???
#15  0xc3ae8f in ???
#16  0xc38984 in ???
#17  0xc3a4f2 in ???
#18  0xc369ce in ???
#19  0xc37203 in ???
#20  0xb99763 in ???
#21  0x7b8fc1 in ???
#22  0x7b365e in ???
#23  0x7b917b in ???
#24  0x78559b in ???
#25  0x7077a9 in __piolib_mod_MOD_syncfile
    at /glade/u/home/mizukami/sandbox_mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1372
#26  0x4193f2 in __pio_utils_MOD_sync_file
    at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/pio_utils.f90:391
#27  0x46dcc8 in __historyfile_MOD_write_flux
    at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/historyFile.f90:483
#28  0x58a35e in __write_simoutput_pio_MOD_output
    at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/write_simoutput_pio.f90:224
#29  0x7042d8 in route_runoff
    at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81
#30  0x7043f7 in main
    at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:11
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
    aborting job
MPT: Received signal 11
ekluzek commented 1 year ago

I do have some GNU tests that work in the latest...

ERI_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default
SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default
ERI_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
RS_PS.f19_f19_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.nldas2_nldas2_rUSGS_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PET_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PET_P215x8.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PFS.f19_f19_rHDMA_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS.f09_f09_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_Mmpi-serial_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_P720x4.nldas2_nldas2_rMERIT_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default

But, it also seems that this requires running for at least 10 years before it shows up.

This has:

gnu/10.1.0 mpt/2.25 netcdf-mpi/4.9.0 pnetcdf/1.12.3

nmizukami commented 9 months ago

More updates. @ekluzek, do you think this is enough information for someone to tell what is the root cause for the error??

This is a test based on derecho with gcc and cray-mpich. The modules loaded for compilation and runs are:

 1) ncarenv/23.09 (S)   2) cmake/3.26.3   3) nccmp/1.9.1.0   4) ncview/2.1.9   5) conda/latest   6) cdo/2.2.2   7) nco/5.1.6   8) gcc/12.2.0   9) hdf5/1.12.2  10) netcdf/4.9.2  11) ncarcompilers/1.0.0  12) craype/2.7.23  13) cray-mpich/8.1.27  14) parallel-netcdf/1.12.3

Note that intel/cray-mpich and gcc/openmpi5.0.0 works fine.

The run died after several time iterations at pio_synch call. Using DDT, I was able to trace back to the pio function where it stopped.

#29 route_runoff () at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81 (at 0x6e187a)
#28 write_simoutput_pio::output (ierr=0, message='o\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000!\\000\\000\\000\\000\\000\\000\\000\\201\\000\\000\\000B\\025\\000\\000pY,\\022\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\
\000\\000\\000\\000\\000\320\266\\227\\r\\000\\000\\000\\000\\360~\\227\\r\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\340\256\211\\f\\000\\000\\000\\000\\217\\340\\265)Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000P\\35
2\\211\\f\\000\\000\\000\\000\\001\\024\\265)Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000p)\\227\\r\\000\\000\\000\\000\\001\\360\\264)Y\\024\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\0000x\\227\\r\\000\\000\\000\\000@s\\266'.
.., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/write_simoutput_pio.f90:218 (at 0x5881bc)
#27 historyfile::sync (this=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000\\240\\344;\\036\\000\\000\\000\\000\\260ky\\a\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000 \\000\\000\
\000\\000\\000\\000\\000P\\360@\\036\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000
\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\b\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\200\\255l\\001\\000\\000\\000\\0
00\\200\\255l\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/historyFile.f90:354 (at 0x47c572)
#26 pio_utils::sync_file (piofiledesc=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000`\\'-\\a\\000\\000\\000\\000-\\303\\002\\000ch/vG\\002\\003\\000\\000\\000\\000\\000\\000\\000\\000\\000\\00
0\\000\\000\\000\\020\\276\\363\\035\\000\\000\\000\\000t\\305\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\\000\\000\\000\\000\\0000\\221\\255\\031\\000\\000\\000\\000PY\\255\\031\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\
360\\025\\254\\031\\000\\000\\000\\000\\004\\247\\305(Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\3400\\254\\031\\000\\000\\000\\000I\\200\\305(Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\020f\\254\\031\\000\\000\\000\
\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/pio_utils.f90:409 (at 0x43578e)
#25 piolib_mod::syncfile (file=(...)) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1470 (at 0x6e5e5a)
#24 PIOc_sync (ncid=129) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_file.c:422 (at 0x76f51a)
#23 flush_buffer (ncid=129, wmb=0x1871f970, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1820 (at 0x7a9af0)
#22 PIOc_write_darray_multi (ncid=129, varids=0x1b5a8020, ioid=512, nvars=5, arraylen=42191, array=0x125066f0, frame=0x19175c40, fillvalue=0x0, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray.c:420 (at 0x7a3b94)
#21 flush_output_buffer (file=0x190c47d0, force=true, addsize=0) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1765 (at 0x7a995a)
#20 ncmpi_wait_all () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x15425120f3cc)
#19 ncmpio_wait () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1f9b)
#18 req_commit () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1751)
#17 wait_getput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c534c)
#16 req_aggregation () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c3781)
#15 mgetput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c5d1a)
#14 ncmpio_read_write () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512cb319)
#13 PMPI_File_write_at_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c9791)
#12 MPIOI_File_write_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c7e59)
#11 ADIOI_GPFS_WriteStridedColl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500d6216)
#10 ADIOI_GPFS_Calc_others_req () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500cede3)
#9 PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2e1ea)
#8 MPIR_Alltoallv_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d1f8)
#7 MPIR_Alltoallv_intra_auto () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d096)
#6 MPIR_Alltoallv_intra_scattered () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f5b8b82)
#5 MPIC_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f6db226)
#4 MPIR_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e97a22f)
#3 MPIR_Waitall_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e911dc1)
#2 MPIDI_SHMI_progress () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff0092f)
#1 MPIR_Cray_Memcpy_wrapper () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff3aea4)
#0 _cray_mpi_memcpy_rome () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500a5f50)

Screen Shot 2024-01-25 at 10 33 25 AM

nmizukami commented 5 months ago

Hi @ekluzek, I heard some issues on pnetcdf in CESM I/O during the CESM workshop (I believe at CSEG working group AND at ultra-high resolution modeling session). Coincidently I did notice that the output error in mizuRoute happens with PIO built with pnetcdf support. When PIO is built without pnetcdf (just use netcdf), mizuRoute PIO output is stable. Note that this happens only for PIO built with gnu and cray-mpich.

ekluzek commented 5 months ago

@nmizukami in looking at both ParallelIO and pnetcdf github pages I don't see an issue about something that might explain this.

can you figure out which talks talked about this? Then we could watch the video and figure out where they talk about this. And then there might be more context to figure out where this will be talked about.

nmizukami commented 5 months ago

Hi Erik, a few talks briefly mentioned pnetcdf issue are in day 2 ultra-high resolution session

SIMA talk: slide 14 or around 08:48:00 in youtube

Earthwork: slide 8