EarthWorksOrg / EarthWorks

Other
3 stars 2 forks source link

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

Open supreethms1809 opened 7 months ago

supreethms1809 commented 7 months ago

Issue Description: Earthworks code abruptly stops (without any error message) when we do a restart with MPAS-A as the dynamical core. We are able to narrow down the issue to a subroutine call cam_mpas_update_halo in cam_mpas_subdriver.F90 and further inside cam_mpas_update_halo --> mpas_pool_get_field_info call. More details to come. we are facing this issue with all Earthworks compsets (FHS94, F2000, QPC6, and fully coupled). Compiler: nvhpc/23.5

gdicker1 commented 6 months ago

Steps to produce:

  1. Run create_newcase and request the NVHPC compiler when creating the case
  2. Go into the case and edit some configurations (e.g. STOP_N, DOUT_S, etc)
  3. Edit options to enable a "restart run" - ./xmlchange REST_OPTION=$STOP_OPT,REST_N=$STOP_N,RESUBMIT=1
  4. Run ./case.setup
  5. Run ./case.build
  6. Run ./case.submit

The error occurs during the runs started with ./case.submit: the first run succeeds as expected (e.g. changes run/*.log* files so they end with .gz), but the restart run fails after writing some output to the atm.log.* and cesm.log.* files (I didn't find any content in the other files). The end of cesm.log.* will contain a message about a rank failing from a signal. I suspect this is failing during part of the initialization of the atmosphere, the next lines I would expect in atm.log.* are about variables U, V, Q, and T being set.

Error at end of cesm.log.*:

dec1014.hsn.de.hpc.ucar.edu: rank 4 died from signal 11 

Note: this was from a test of FHS94 on Derecho with NVHPC and Intel-OneAPI compilers. The Intel-OneAPI build finished both runs. Note: these runs were without GPU flags (i.e. they were CPU-only runs)

gdicker1 commented 6 months ago

@supreethms1809 I think given my CPU-only tests in the comment above, we should change this title to be NVHPC specific. I don't think GPU usage is involved here.

gdicker1 commented 2 months ago

This is confirmed to still be an issue today. I just ran a F2000climoEW test with the NVHPC v24.3 compilers and the restart run failed.

The last output in the atm.log.* file is:

   i MPAS constituent mpas_from_cam_cnst(i)       i CAM constituent  cam_from_mpas_cnst(i)
 ------------------------------------------     ------------------------------------------
   1              qv*                  1          1                Q                  1
# Skipping the other lines from the table
  33            SOAE                  33         33             SOAE                 33
  34            SOAG                  34         34             SOAG                 34
 ------------------------------------------     ------------------------------------------
 * = constituent used as a moisture species in MPAS-A dycore

The next lines I would have expected are:


 vertical coordinate dycore   : Height (z) vertical coordinate
 min/max of meshScalingDel2 = 1.00000000000000 1.00000000000000
 min/max of meshScalingDel4 = 1.00000000000000 1.00000000000000

The last output in the cesm.log* file is:

dec2284.hsn.de.hpc.ucar.edu 11: /var/run/palsd/bc550ddd-1ddc-4351-bdf0-6c58e7d59bb0/files/cpu_bind: line 77: 36589 Segmentation fault      numactl -C         "${ranges[lrank]}" $*
dec2284.hsn.de.hpc.ucar.edu: rank 11 exited with code 139
dec2284.hsn.de.hpc.ucar.edu: rank 0 died from signal 15
sherimickelson commented 1 month ago

I was able to successfully run a restart run using the nvhpc compiler and mpas dynamical core.

It looks like the problem was here call cam_mpas_update_halo('latCell', endrun) in subroutine cam_mpas_read_restart(restart_stream, endrun), cam/src/dynamics/mpas/driver/cam_mpas_subdriver.F90

When I remove the "endrun" argument the code is able to get past this point and complete the restart run.

The problem is occurring because endrun is initially passed in as use cam_abortutils, only: endrun but then declared as procedure(halt_model) :: endrun in subroutine cam_mpas_read_restart(restart_stream, endrun) which calls the subroutine where it fails subroutine cam_mpas_update_halo(fieldName, endrun)

This occurs all over this file, but as far as I can see, endrun is only executed if an error is encountered, except in this function where it's passed. This is where it looks to be failing with a memory overwrite of 'latCell', I'm not sure why it's ok with other compilers but nvidia does not.

I don't know if "removing endrun as an argument" is the correct fix, but it gives us a place to start talking about how we want to fix it.

sherimickelson commented 1 month ago

Reproducer found here https://github.com/sherimickelson/cam_mpas_restart_reproducer

gdicker1 commented 4 weeks ago

Based on info from @cponder, a fix for this issue should come with NVHPC 24.9 next month.