NCAR / ParallelIO

A high-level Parallel I/O Library for structured grid applications
Apache License 2.0
134 stars 52 forks source link

PIO2 error with some iotasks #1986

Open apcraig opened 6 months ago

apcraig commented 6 months ago

I get the following error on Derecho with PIO2 if the total number of tasks is a little short. My particular test case has 8 iotasks and stride of 4. If the root iotask is 0, then the iotasks would be mpi tasks 0,4,8,12,16,20,24,28. If the root iotask is 1, then the iotasks would be mpi tasks 1,5,9,13,17,21,25,29. The former should run fine with 29 total tasks. The latter with 30. Both run fine with 32 tasks, but error (as below) with 31 or 30 tasks. This error occurs for all format types (cdf1, cdf2, cdf5, hdf5) and with netcdf or pnetcdf. It also happens with all compilers (i.e. intel, gnu, cray). PIO1 works fine.

Testing was done on Derecho in Feb, 2024 with CICE using

module load parallelio/1.10.1

or

module load parallelio/2.6.1

The error looks like

Obtained 10 stack frames. /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.1/cray-mpich/8.1.25/oneapi/2023.0.0/jxom/lib/libpioc.so(pio_err+0x80) [0x149094d7c180] /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.1/cray-mpich/8.1.25/oneapi/2023.0.0/jxom/lib/libpioc.so(PIOc_Init_Intracomm+0xc9) [0x149094d84cf9] /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.1/cray-mpich/8.1.25/oneapi/2023.0.0/jxom/lib/libpioc.so(PIOc_Init_Intracomm_from_F90+0x14) [0x149094d851d4] /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.1/cray-mpich/8.1.25/oneapi/2023.0.0/jxom/lib/libpiof.so(piolib_mod_mp_initintracom+0xd6) [0x149094fcb746] /var/run/palsd/3d07ccb2-c26d-421d-8a14-427ee69dfcbc/files/cice() [0x13e82ef] MPICH ERROR [Rank 0] [job id 3d07ccb2-c26d-421d-8a14-427ee69dfcbc] [Tue Feb 20 11:53:34 2024] [dec2097] - Abort(-1) (rank 0 in comm 0): application called MPI_Abort( MPI_COMM_WORLD, -1) - process 0

jedwards4b commented 6 months ago

parallelio/2.6.2 is the latest - can you try with that?

apcraig commented 6 months ago

Confirm same error with parallelio/2.6.2

Obtained 10 stack frames. /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.25/oneapi/2023.0.0/qn7u/lib/libpioc.so(pio_err+0x80) [0x14dc95f91f10] /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.25/oneapi/2023.0.0/qn7u/lib/libpioc.so(PIOc_Init_Intracomm+0xc9) [0x14dc95f9aa89] /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.25/oneapi/2023.0.0/qn7u/lib/libpioc.so(PIOc_Init_Intracomm_from_F90+0x14) [0x14dc95f9af64] /glade/u/apps/derecho/23.06/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.25/oneapi/2023.0.0/qn7u/lib/libpiof.so(piolib_mod_mp_initintracom+0xd6) [0x14dc961e09d6] /var/run/palsd/4a56ef74-20e1-4a2d-8210-61c95e00e74e/files/cice() [0x13e8340]

jedwards4b commented 6 months ago

Thank you - now can you provide instructions and source code so that I can reproduce the problem?

apcraig commented 6 months ago

We should have an updated version of CICE on main this week with all these new PIO control capabilities. Once the PR is merged, I'll provide a guide to reproduce the problem. More soon.

apcraig commented 6 months ago

To duplicate the problem with CICE, do the following

git clone https://github.com/cice-consortium/cice cice.testpio
cd cice.testpio
git checkout aca835755aa82ead
git submodule update --init
edit cicecore/cicedyn/infrastructure/io/io_pio2/ice_pio.F90, about line 165 to turn off the limiter when iotasks is user defined.

@@ -165,8 +165,8 @@ subroutine ice_pio_init(mode, filename, File, clobber, fformat, &
    lroot = min(lroot,nprocs-1)   ! lroot <= nprocs-1
 !  Adjustments for PIO2 iotask issue, https://github.com/NCAR/ParallelIO/issues/1986
 !   liotasks = max(1,min(liotasks, (nprocs-lroot)/lstride))  ! very conservative
-   liotasks = max(1,min(liotasks,nprocs/lstride - lroot/lstride))  ! less conservative (note integer math)
-!   liotasks = max(1,min(liotasks, 1 + (nprocs-lroot-1)/lstride))  ! optimal
+!   liotasks = max(1,min(liotasks,nprocs/lstride - lroot/lstride))  ! less conservative (note integer math)
+   liotasks = max(1,min(liotasks, 1 + (nprocs-lroot-1)/lstride))  ! optimal

./cice.setup --case test1 -m derecho -e gnu -p 31x1 -s iopio2
cd test1
./cice.build
./*.submit

This runs pio in some default settings and should run fine. In the logs/cice.runlog file, you'll see this,

 (ice_pio_init) nprocs     =           31
 (ice_pio_init) pio_iotype =            2
 (ice_pio_init) iotasks    =            7
 (ice_pio_init) baseroot   =            1
 (ice_pio_init) stride     =            4
 (ice_pio_init) nmode      =            0

so it's using 7 iotasks with a stride of 4 and rootpe of 1. Now, manually set the iotasks in ice_in,

    restart_iotasks = 8
    restart_root   = 1
    restart_stride = 4

    history_iotasks = 8
    history_root   = 1
    history_stride = 4

then

./*.submit

This will fail with the error. If you want to play around with the modules, that's in file env.derecho_gnu. You can do the same tests with the intel or cray compiler by changing "-e gnu" to "-e intel" or "-e cray" on the cice.setup command line.

Please feel free to email or comment here if there are any questions. Thanks!