E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
343 stars 352 forks source link

FATES PIO issue for `f19_g16` resolution `ERP` tests #6316

Open glemieux opened 4 months ago

glemieux commented 4 months ago

In the fates test list we have two debug mode ERP tests using the f19_g16 resolution for the default set of fates run modes. The difference between the tests is that one runs with the gnu compiler and one runs intel. Both of these tests are failing with a PIO error while accessing the restart file:

 64: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3.elm.r.0001-01-03-00000.nc, ncid=56) failed (Number of pending requests on file = 129, Number of variables with pending requests = 129, Number of request blocks = 2, Current block being waited on = 0, Number of requests in current block = 92).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u1/g/glemieux/E3SM-project/e3sm/externals/scorpio/src/clib/pio_darray_int.c: 2087)
 64: Obtained 10 stack frames.
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a40be8]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a3fc95]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a823bf]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a73b7c]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a827d8]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a74cc3]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a303f4]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x4616652]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x467faf3]
 64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0xb81e4e]
 64: MPICH ERROR [Rank 64] [job id 23646208.0] [Fri Mar 29 12:37:52 2024] [nid006735] - Abort(-1) (rank 64 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 64
 64:
 64: aborting job:
 64: application called MPI_Abort(MPI_COMM_WORLD, -1) - process 64
srun: error: nid006735: task 64: Exited with exit code 255

We have a other similar fates ERP tests that run on ne4pg2_ne4pg2 and f09_g16 that don't seem to hit this issue, although those are not being run in debug mode.

jayeshkrishna commented 4 months ago
glemieux commented 4 months ago
  • Is this issue occuring with the latest E3SM master?

Yes, nearly the latest master. This was discovered when generating new fates test list baselines using E3SM v3.0.0-104-g7792c63c19 (commit from 4 days ago) and fates tag sci.1.70.0_api.32.0.0_tools.1.1.0.

  • Do you see this issue on other machines (apart from pm)?

To be determined.

  • Is the test using PnetCDF or NetCDF for writes (xmlquery for PIO_TYPENAME)?

Looks like land is using PnetCDF: PIO_TYPENAME: ['CPL:pnetcdf', 'ATM:netcdf', 'LND:pnetcdf', 'ICE:pnetcdf', 'OCN:pnetcdf', 'ROF:pnetcdf', 'GLC:pnetcdf', 'WAV:pnetcdf', 'IAC:pnetcdf', 'ESP:pnetcdf']

  • How many MPI processes is the test using on PM?

128 tasks. Here's the preview_run output:

CASE INFO:
  nodes: 1
  total tasks: 128
  tasks per node: 128
  thread count: 1
  ngpus per node: 0

BATCH INFO:
  FOR JOB: case.test
    ENV:
      Setting Environment ADIOS2_ROOT=/global/cfs/cdirs/e3sm/3rdparty/adios2/2.9.1/cray-mpich-8.1.25/gcc-11.2.0
      Setting Environment Albany_ROOT=/global/common/software/e3sm/mali_tpls/albany-e3sm-serial-release-gcc
      Setting Environment BLA_VENDOR=Generic
      Setting Environment FI_CXI_RX_MATCH_MODE=software
      Setting Environment GATOR_INITIAL_MB=4000MB
      Setting Environment HDF5_USE_FILE_LOCKING=FALSE
      Setting Environment MPICH_COLL_SYNC=MPI_Bcast
      Setting Environment MPICH_ENV_DISPLAY=1
      Setting Environment MPICH_VERSION_DISPLAY=1
      Setting Environment NETCDF_PATH=/opt/cray/pe/netcdf-hdf5parallel/4.9.0.3/gnu/9.1
      Setting Environment OMP_NUM_THREADS=1
      Setting Environment OMP_PLACES=threads
      Setting Environment OMP_PROC_BIND=spread
      Setting Environment OMP_STACKSIZE=128M
      Setting Environment PERL5LIB=/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch
      Setting Environment PNETCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.3/gnu/9.1
      Setting Environment Trilinos_ROOT=/global/common/software/e3sm/mali_tpls/trilinos-e3sm-serial-release-gcc

    SUBMIT CMD:
      sbatch --time 00:31:40 -q regular --account m2420 .case.test 

    MPIRUN (job=case.test):
      srun  --label  -n 128 -N 1 -c 2  --cpu_bind=cores   -m plane=128 /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_gnu.elm-fates_cold.G.20240329_093430_nc2qos/bld/e3sm.exe   >> e3sm.log.$LID 2>&1 

I've confirmed this fails in non-debug mode as well.

jayeshkrishna commented 4 months ago

Thanks, can you also print out the PIO_BUFFER_SIZE_LIMIT (./xmlquery PIO_BUFFER_SIZE_LIMIT) for the test?

We might be able to overcome this limit by increasing the number of I/O tasks too (setting PIO_NUMTASKS to say 8)

jayeshkrishna commented 4 months ago

Try adding a testmod (like SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.anlgce_gnu.elm-force_netcdf_pio uses -- ./components/elm/cime_config/testdefs/testmods_dirs/elm/force_netcdf_pio) to set the number of I/O tasks for the test to 8 (./xmlchange PIO_NUMTASKS=8; ./xmlchange PIO_STRIDE=-99) and see if it works.

glemieux commented 4 months ago

Thanks, can you also print out the PIO_BUFFER_SIZE_LIMIT (./xmlquery PIO_BUFFER_SIZE_LIMIT) for the test?

We might be able to overcome this limit by increasing the number of I/O tasks too (setting PIO_NUMTASKS to say 8)

PIO_BUFFER_SIZE_LIMIT: -1

glemieux commented 4 months ago

force_netcdf_pio

I'm sorry, I don't quite understand what you're suggesting here. Do you want me to modify the failing f19_g16 test to use the force_netcdf_pio tesmod shell script, adding the ./xmlchange commands you noted to it as well?

jayeshkrishna commented 4 months ago

No, just add a testmod for the failing ERP test so that you can set the PIO_NUMTASKS to 8 and PIO_STRIDE to -99 (I mentioned the *elm-force_netcdf_pio test to use as a reference on how to add/set testmods for CIME tests).

glemieux commented 4 months ago

:tada: That did the trick. The test passes using the above PIO_NUMTASKS and PIO_STRIDE values you suggested @jayeshkrishna. What's the next steps for addressing this?

jayeshkrishna commented 4 months ago

Can you also check if PIO_NUMTASKS=4 works? The solution for this issue would be to set the number of I/O tasks (8 or 4) permanently in a testmod for this test (add the above xmlchange commands to the testmod associated with this test). The value should get reset by E3SM (share utils) if the test is run with less than 8/4 procs.

glemieux commented 4 months ago

PIO_NUMTASKS=4 works as well.

This particular testmod, fates_cold, is used pretty widely across a number of resolutions and also the basis for other testmods. I can create a resolution-specific testmod for this one test for this resolution, but I'm wondering if there are other options for updating the PIO settings without having to tie a testmod to a given resolution.

jayeshkrishna commented 4 months ago

ok, I will try to recreate the issue and find a fix for it in SCORPIO. Meanwhile, you can add the testmod to get the test working on PM.

glemieux commented 4 months ago

Thanks for all your help @jayeshkrishna

rljacob commented 4 months ago

You can put "if" statements in the shell_commands file and only take action if its a certain resolution. See this example for the "noio" testmod in eam:

(base) jacob@Roberts-MacAirM2 noio % more shell_commands
#!/bin/bash
./xmlchange --append CAM_CONFIG_OPTS='-cosp'

# save benchmark timing info for provenance
./xmlchange SAVE_TIMING=TRUE

# on KNLs, run hyper-threaded with 64x2
if [ `./xmlquery --value MACH` == theta ]||[ `./xmlquery --value MACH` == cori-knl ]; then
  ./xmlchange MAX_MPITASKS_PER_NODE=64
  ./xmlchange MAX_TASKS_PER_NODE=128
  ./xmlchange NTHRDS=2
  # avoid over-decomposing LND beyond 7688 clumps (grid cells)
  if [ `./xmlquery --value NTASKS_LND` -gt 3844 ]; then ./xmlchange NTHRDS_LND=1; fi
else
  ./xmlchange NTHRDS=1
fi
glemieux commented 4 months ago

Thanks for the suggestion @rljacob. I forgot I could xmlquery LND_GRID.

glemieux commented 4 months ago

I should note for reference, that this test was working as of 67abd00. It stopped working sometime between then and 069c226.