Closed supreethms1809 closed 1 month ago
Steps to produce:
create_newcase
and request the NVHPC compiler when creating the case./xmlchange REST_OPTION=$STOP_OPT,REST_N=$STOP_N,RESUBMIT=1
./case.setup
./case.build
./case.submit
The error occurs during the runs started with ./case.submit
: the first run succeeds as expected (e.g. changes run/*.log*
files so they end with .gz
), but the restart run fails after writing some output to the atm.log.*
and cesm.log.*
files (I didn't find any content in the other files). The end of cesm.log.*
will contain a message about a rank failing from a signal. I suspect this is failing during part of the initialization of the atmosphere, the next lines I would expect in atm.log.*
are about variables U
, V
, Q
, and T
being set.
Error at end of cesm.log.*
:
dec1014.hsn.de.hpc.ucar.edu: rank 4 died from signal 11
Note: this was from a test of FHS94 on Derecho with NVHPC and Intel-OneAPI compilers. The Intel-OneAPI build finished both runs. Note: these runs were without GPU flags (i.e. they were CPU-only runs)
@supreethms1809 I think given my CPU-only tests in the comment above, we should change this title to be NVHPC specific. I don't think GPU usage is involved here.
This is confirmed to still be an issue today. I just ran a F2000climoEW test with the NVHPC v24.3 compilers and the restart run failed.
The last output in the atm.log.*
file is:
i MPAS constituent mpas_from_cam_cnst(i) i CAM constituent cam_from_mpas_cnst(i)
------------------------------------------ ------------------------------------------
1 qv* 1 1 Q 1
# Skipping the other lines from the table
33 SOAE 33 33 SOAE 33
34 SOAG 34 34 SOAG 34
------------------------------------------ ------------------------------------------
* = constituent used as a moisture species in MPAS-A dycore
The next lines I would have expected are:
vertical coordinate dycore : Height (z) vertical coordinate
min/max of meshScalingDel2 = 1.00000000000000 1.00000000000000
min/max of meshScalingDel4 = 1.00000000000000 1.00000000000000
The last output in the cesm.log*
file is:
dec2284.hsn.de.hpc.ucar.edu 11: /var/run/palsd/bc550ddd-1ddc-4351-bdf0-6c58e7d59bb0/files/cpu_bind: line 77: 36589 Segmentation fault numactl -C "${ranges[lrank]}" $*
dec2284.hsn.de.hpc.ucar.edu: rank 11 exited with code 139
dec2284.hsn.de.hpc.ucar.edu: rank 0 died from signal 15
I was able to successfully run a restart run using the nvhpc compiler and mpas dynamical core.
It looks like the problem was here call cam_mpas_update_halo('latCell', endrun) in subroutine cam_mpas_read_restart(restart_stream, endrun), cam/src/dynamics/mpas/driver/cam_mpas_subdriver.F90
When I remove the "endrun" argument the code is able to get past this point and complete the restart run.
The problem is occurring because endrun is initially passed in as use cam_abortutils, only: endrun but then declared as procedure(halt_model) :: endrun in subroutine cam_mpas_read_restart(restart_stream, endrun) which calls the subroutine where it fails subroutine cam_mpas_update_halo(fieldName, endrun)
This occurs all over this file, but as far as I can see, endrun is only executed if an error is encountered, except in this function where it's passed. This is where it looks to be failing with a memory overwrite of 'latCell', I'm not sure why it's ok with other compilers but nvidia does not.
I don't know if "removing endrun as an argument" is the correct fix, but it gives us a place to start talking about how we want to fix it.
Reproducer found here https://github.com/sherimickelson/cam_mpas_restart_reproducer
Based on info from @cponder, a fix for this issue should come with NVHPC 24.9 next month.
NVHPC 24.9 fixes this issue
Once merged, PR #77 will use nvhpc/24.9 by default.
Issue Description: Earthworks code abruptly stops (without any error message) when we do a restart with MPAS-A as the dynamical core. We are able to narrow down the issue to a subroutine call cam_mpas_update_halo in cam_mpas_subdriver.F90 and further inside cam_mpas_update_halo --> mpas_pool_get_field_info call. More details to come. we are facing this issue with all Earthworks compsets (FHS94, F2000, QPC6, and fully coupled). Compiler: nvhpc/23.5