Open nmizukami opened 1 year ago
I do have some GNU tests that work in the latest...
ERI_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default
SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default
ERI_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
RS_PS.f19_f19_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.nldas2_nldas2_rUSGS_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PET_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PET_P215x8.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PFS.f19_f19_rHDMA_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS.f09_f09_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_Mmpi-serial_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_P720x4.nldas2_nldas2_rMERIT_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
But, it also seems that this requires running for at least 10 years before it shows up.
This has:
gnu/10.1.0 mpt/2.25 netcdf-mpi/4.9.0 pnetcdf/1.12.3
More updates. @ekluzek, do you think this is enough information for someone to tell what is the root cause for the error??
This is a test based on derecho with gcc and cray-mpich. The modules loaded for compilation and runs are:
1) ncarenv/23.09 (S) 2) cmake/3.26.3 3) nccmp/1.9.1.0 4) ncview/2.1.9 5) conda/latest 6) cdo/2.2.2 7) nco/5.1.6 8) gcc/12.2.0 9) hdf5/1.12.2 10) netcdf/4.9.2 11) ncarcompilers/1.0.0 12) craype/2.7.23 13) cray-mpich/8.1.27 14) parallel-netcdf/1.12.3
Note that intel/cray-mpich and gcc/openmpi5.0.0 works fine.
The run died after several time iterations at pio_synch call. Using DDT, I was able to trace back to the pio function where it stopped.
#29 route_runoff () at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81 (at 0x6e187a)
#28 write_simoutput_pio::output (ierr=0, message='o\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000!\\000\\000\\000\\000\\000\\000\\000\\201\\000\\000\\000B\\025\\000\\000pY,\\022\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\
\000\\000\\000\\000\\000\320\266\\227\\r\\000\\000\\000\\000\\360~\\227\\r\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\340\256\211\\f\\000\\000\\000\\000\\217\\340\\265)Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000P\\35
2\\211\\f\\000\\000\\000\\000\\001\\024\\265)Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000p)\\227\\r\\000\\000\\000\\000\\001\\360\\264)Y\\024\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\0000x\\227\\r\\000\\000\\000\\000@s\\266'.
.., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/write_simoutput_pio.f90:218 (at 0x5881bc)
#27 historyfile::sync (this=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000\\240\\344;\\036\\000\\000\\000\\000\\260ky\\a\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000 \\000\\000\
\000\\000\\000\\000\\000P\\360@\\036\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000
\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\b\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\200\\255l\\001\\000\\000\\000\\0
00\\200\\255l\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/historyFile.f90:354 (at 0x47c572)
#26 pio_utils::sync_file (piofiledesc=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000`\\'-\\a\\000\\000\\000\\000-\\303\\002\\000ch/vG\\002\\003\\000\\000\\000\\000\\000\\000\\000\\000\\000\\00
0\\000\\000\\000\\020\\276\\363\\035\\000\\000\\000\\000t\\305\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\\000\\000\\000\\000\\0000\\221\\255\\031\\000\\000\\000\\000PY\\255\\031\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\
360\\025\\254\\031\\000\\000\\000\\000\\004\\247\\305(Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\3400\\254\\031\\000\\000\\000\\000I\\200\\305(Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\020f\\254\\031\\000\\000\\000\
\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/pio_utils.f90:409 (at 0x43578e)
#25 piolib_mod::syncfile (file=(...)) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1470 (at 0x6e5e5a)
#24 PIOc_sync (ncid=129) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_file.c:422 (at 0x76f51a)
#23 flush_buffer (ncid=129, wmb=0x1871f970, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1820 (at 0x7a9af0)
#22 PIOc_write_darray_multi (ncid=129, varids=0x1b5a8020, ioid=512, nvars=5, arraylen=42191, array=0x125066f0, frame=0x19175c40, fillvalue=0x0, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray.c:420 (at 0x7a3b94)
#21 flush_output_buffer (file=0x190c47d0, force=true, addsize=0) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1765 (at 0x7a995a)
#20 ncmpi_wait_all () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x15425120f3cc)
#19 ncmpio_wait () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1f9b)
#18 req_commit () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1751)
#17 wait_getput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c534c)
#16 req_aggregation () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c3781)
#15 mgetput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c5d1a)
#14 ncmpio_read_write () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512cb319)
#13 PMPI_File_write_at_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c9791)
#12 MPIOI_File_write_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c7e59)
#11 ADIOI_GPFS_WriteStridedColl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500d6216)
#10 ADIOI_GPFS_Calc_others_req () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500cede3)
#9 PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2e1ea)
#8 MPIR_Alltoallv_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d1f8)
#7 MPIR_Alltoallv_intra_auto () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d096)
#6 MPIR_Alltoallv_intra_scattered () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f5b8b82)
#5 MPIC_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f6db226)
#4 MPIR_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e97a22f)
#3 MPIR_Waitall_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e911dc1)
#2 MPIDI_SHMI_progress () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff0092f)
#1 MPIR_Cray_Memcpy_wrapper () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff3aea4)
#0 _cray_mpi_memcpy_rome () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500a5f50)
Hi @ekluzek, I heard some issues on pnetcdf in CESM I/O during the CESM workshop (I believe at CSEG working group AND at ultra-high resolution modeling session). Coincidently I did notice that the output error in mizuRoute happens with PIO built with pnetcdf support. When PIO is built without pnetcdf (just use netcdf), mizuRoute PIO output is stable. Note that this happens only for PIO built with gnu and cray-mpich.
@nmizukami in looking at both ParallelIO and pnetcdf github pages I don't see an issue about something that might explain this.
can you figure out which talks talked about this? Then we could watch the video and figure out where they talk about this. And then there might be more context to figure out where this will be talked about.
When using gnu compiler with MPT, PIO sync fails (seemingly randomly) as segmentation fault (invalid memory reference).
Using intel compiler with MPT works fine.
Using gnu with openmpi works fine (seems to be).
This error happen with mizuRoute with large high resolution river network data (MERIT-Hydro)
I have been running into this problem for long time (for several years now).
More specific configuration is: gnu v12.1.0 netcdf v 4.8.1 pnetcdf v1.12.3 mpt v2.25
The trace back looks like this (run with debug mode: flag is
-g -Wall -fmax-errors=0 -fbacktrace -fcheck=all
). 14 through 25 are not displayed: they would be in C codes.piolib_mod.F90 Line 1372 is just
PIOc_sync(file%fh)