C768 analysis tasks Fail on Hera

spanNOAA commented 5 months ago

What is wrong?

The gdassfcanl, gfssfcanl, and gdasanalcalc tasks encounter failure from the second cycle. Regardless of the time wall set for the job, the tasks consistently exceed the time limit.

I am attempting to run the simulations starting from 2023021018 and ending 202302261800.

Brief snippet of error from gdassfcanl.log and gfssfcanl.log file for 2023021100 forecast cycle: 0: update OUTPUT SFC DATA TO: ./fnbgso.001 0: 0: CYCLE PROGRAM COMPLETED NORMALLY ON RANK: 0 0: slurmstepd: error: STEP 58349057.0 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT slurmstepd: error: JOB 58349057 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT

Start Epilog on node h34m13 for job 58349057 :: Tue Apr 16 21:54:17 UTC 2024 Job 58349057 finished for user Sijie.Pan in partition hera with exit code 0:0

End Epilogue Tue Apr 16 21:54:17 UTC 2024

Brief snippet of error from gdasanalcalc.log file for 2023021100 forecast cycle:

. . . . . . . . . . . . . . . . . . . * . PROGRAM INTERP_INC HAS BEGUN. COMPILED 2019100.00 ORG: EMC STARTING DATE-TIME APR 15,2024 17:16:27.299 106 MON 2460416
- READ SETUP NAMELIST
- WILL INTERPOLATE TO GAUSSIAN GRID OF DIMENSION 3072 1536
- OPEN OUTPUT FILE: inc.fullres.03
- OPEN INPUT FILE: siginc.nc.03
- PROCESS RECORD: u_inc
- PROCESS RECORD: T_inc
- PROCESS RECORD: sphum_inc
- PROCESS RECORD: delp_inc
- PROCESS RECORD: delz_inc
- PROCESS RECORD: liq_wat_inc
- PROCESS RECORD: o3mr_inc
- PROCESS RECORD: icmr_inc srun: Complete StepId=58250207.0 received slurmstepd: error: STEP 58250207.0 ON h1m01 CANCELLED AT 2024-04-15T17:36:15 DUE TO TIME LIMIT slurmstepd: error: JOB 58250207 ON h1m01 CANCELLED AT 2024-04-15T17:36:15 DUE TO TIME LIMIT
  
  Start Epilog on node h1m01 for job 58250207 :: Mon Apr 15 17:36:18 UTC 2024 Job 58250207 finished for user Sijie.Pan in partition bigmem with exit code 0:0
  
  End Epilogue Mon Apr 15 17:36:18 UTC 2024

What should have happened?

The tasks 'gdassfcanl', 'gfssfcanl', and 'gdasanalcalc' generate the respective files required for the remainder of the workflow to use.

What machines are impacted?

Hera

Steps to reproduce

Set up experiment and generate xml file. ./setup_expt.py gfs cycled --app ATM --pslot C768_6hourly_0210 --nens 80 --idate 2023021018 --edate 2023022618 --start cold --gfs_cyc 4 --resdetatmos 768 --resensatmos 384 --configdir /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/parm/config/gfs --comroot ${COMROOT} --expdir ${EXPDIR} --icsdir /scratch2/BMC/wrfruc/Guoqing.Ge/ufs-ar/ICS/2023021018C768C384L128/output
Change time wall for the gdassfcanl, gfssfcanl, and gdasanalcalc tasks.
use rocoto to start the workflow.

Additional information

You can find gdassfcanl.log, gfssfcanl.log and gdasanalcalc.log in the following directory: /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/C768_6hourly_0210/logs/2023021100

Do you have a proposed solution?

No response

HenryRWinterbottom commented 5 months ago

@spanNOAA Have you compiled UFS with the unstructured wave grids (option -w)?

spanNOAA commented 5 months ago

@spanNOAA Have you compiled UFS with the unstructured wave grids (option -w)?

No, I compiled the global workflow only using the '-g' option.

HenryRWinterbottom commented 5 months ago

@spanNOAA Are you using the top of the develop branch for the g-w?

spanNOAA commented 5 months ago

@spanNOAA Are you using the top of the develop branch for the g-w?

Yes, I'm using the develop branch.

spanNOAA commented 5 months ago

FYI, this problem was only observed with C768. I have no issue with C384.

HenryRWinterbottom commented 5 months ago

@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera?

spanNOAA commented 5 months ago

@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera?

I can locate the local repo at: /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs.

HenryRWinterbottom commented 5 months ago

@spanNOAA Thank you.

Can you please check out and/or update your current develop branch and recompile the UFS model? You can do so as follows.

user@host:$ cd sorc
user@host:$ ./build_ufs.sh -w

That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised?

spanNOAA commented 5 months ago

@spanNOAA Thank you.

Can you please check out and/or update your current develop branch and recompile the UFS model? You can do so as follows.
user@host:$ cd sorc
user@host:$ ./build_ufs.sh -w
That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised?

Certainly. But before doing so, may I ask two questions:

Will utilizing the -w option have any impact on the analysis or forecast outcomes?
Has the latest develop branch been updated with the kchunk3d bug fixes for the ufs model that were merged yesterday?

HenryRWinterbottom commented 5 months ago

Yes, I would assume there to be some differences between the use of a structured versus unstructured wave grid;
Can you send me the tag for that branch? It likely has not been updated. However, you can clone the respective branch, that you are referencing, into sorc/ufs_model.fd and then build. Make sure to run sorc/link_workflow.sh once you make the updates.

WalterKolczynski-NOAA commented 5 months ago

These are analysis jobs and have nothing to do with -w, don't worry about it.

C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small.

How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well?

spanNOAA commented 5 months ago

When you mention checking out and/or updating my current develop branch, are you indicating that the entire global workflow needs updating, or is it solely the ufs model that requires to be updated? The version I mentioned is 281b32f.

HenryRWinterbottom commented 5 months ago

@spanNOAA Thank you for the tag. We are currently testing hash 281b32fb but encountering errors when executing the forecast (e.g., ufs_model.exe) for C768 resolutions. As a result, the referenced UFSWM tag will not work at the moment. Please see issue #2490.

WalterKolczynski-NOAA commented 5 months ago

Additionally, when you tried increasing the wallclock, did you regenerate your rocoto XML afterwards?

WalterKolczynski-NOAA commented 5 months ago

@spanNOAA Thank you for the tag. We are currently testing hash 281b32fb but encountering errors when executing the forecast (e.g., ufs_model.exe) for C768 resolutions. As a result, the referenced UFSWM tag will not work at the moment. Please see issue #2490.

These failures are in the analysis job. It is unlikely anything with UFS or its build is the problem here.

spanNOAA commented 5 months ago

These are analysis jobs and have nothing to do with -w, don't worry about it.

C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small.

How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well?

I attempted wallclock settings ranging from 10 to 40 minutes, but none of them works. When the wallclock was set to 20 minutes or more, the program consistently stalled at the same point. I haven't tried to increase the number of cores. According to the log, the program seems to terminate normally. However, the slurm job continues afterward for unknown reason. And I manually extended the wallclock duration by directly editing the XML file instead of using config files.

WalterKolczynski-NOAA commented 5 months ago

Okay, I'm going to check your full log and see if I can find anything, otherwise might need to get a specialist to look at it.

spanNOAA commented 5 months ago

I really appreciate it.

WalterKolczynski-NOAA commented 5 months ago

looking at sfcanl, the problem seems to be in global_cycle. Ranks 0-2 finish but 3-5 never do:

> egrep '^3:' gdassfcanl.log.1 
3:  
3:  STARTING CYCLE PROGRAM ON RANK            3
3:  RUNNING WITH            6 TASKS
3:  AND WITH            1  THREADS.
3:  
3:  READ NAMCYC NAMELIST.
3:  
3:  
3:  IN ROUTINE SFCDRV,IDIM=         768 JDIM=         768 FH=
3:   0.000000000000000E+000
3:  - RUNNING WITH FRACTIONAL GRID.
3:  
3:  READ FV3 GRID INFO FROM: ./fngrid.004
3:  
3:  READ FV3 OROG INFO FROM: ./fnorog.004
3:  
3:  WILL PROCESS NSST RECORDS.
3:  
3:  READ INPUT SFC DATA FROM: ./fnbgsi.004
3:  - WILL PROCESS FOR NOAH-MP LSM.
3:  
3:  WILL READ NSST RECORDS.
3:  
3:  USE UNFILTERED OROGRAPHY.
3:  
3:  SAVE FIRST GUESS MASK
3:  
3:  CALL SFCCYCLE TO UPDATE SURFACE FIELDS.

Since the ranks are tiles, they should all be similar run times. I think this points back to a memory issue. Try changing the resource request to:

<nodes>6:ppn=40:tpp=1</nodes>

That should be overkill, but if it works we can try dialing it back.

spanNOAA commented 5 months ago

The problem remains despite increasing the nodes to 6.

WalterKolczynski-NOAA commented 5 months ago

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

GeorgeGayno-NOAA commented 5 months ago

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

Sure. The global_cycle program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.

export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
      -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)

@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?

spanNOAA commented 5 months ago

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

Sure. The global_cycle program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.
export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
      -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)
@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?

It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5c.

GeorgeGayno-NOAA commented 4 months ago

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

Sure. The global_cycle program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.
export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
      -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)
@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?
It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5.

Let me try to run the cycle step myself. Don't delete your working directories.

GeorgeGayno-NOAA commented 4 months ago

I was able to run your test case using my own stand-alone script - /scratch1/NCEPDEV/da/George.Gayno/cycle.broke

If I just run tile 1, there is a bottleneck in the interpolation of the GLDAS soil moisture to the tile:

  in fixrdc for mon=           1  fngrib=
 /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/fix/am/global_soilmgldas.statsgo.t15
 34.3072.1536.grb

The interpolation for month=1 takes 6:30 minutes. And there are many uninterpolated points:

unable to interpolate. filled with nearest point value at 359656 points

The UFS_UTILS C768 regression test, which uses a non-fractional grid, runs very quickly. And there are very few uninterpolated points:

unable to interpolate. filled with nearest point value at 309 points

The C48 regression test uses a fractional grid. It runs quickly, but there is a very high percentage of uninterpolated points:

  in fixrdc for mon=           3  fngrib=
 /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/global_cycl
 e/../../fix/am/global_soilmgldas.statsgo.t94.192.96.grb

  unable to interpolate.  filled with nearest point value at         1308  points

Maybe there is a problem with how the interpolation mask is being set up for fractional grids?

spanNOAA commented 4 months ago

Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids.

GeorgeGayno-NOAA commented 4 months ago

Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids.

I think the mask problem is a bug in the global_cycle code. I will need to run some tests.

GeorgeGayno-NOAA commented 4 months ago

@spanNOAA - I found the problem and have a fix. What hash of the ccpp-physics are you using?

spanNOAA commented 4 months ago

I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.

guoqing-noaa commented 4 months ago

For sorc/ufs_utils.fd/ccpp-physics: the hash is:

commit 3a306a493a9a0b6c3c39c7b50d356f0ddb7c5c94 (HEAD)
Merge: eda81a58 17c73687
Author: Grant Firl <grant.firl@noaa.gov>
Date:   Tue May 9 13:14:47 2023 -0400
    Merge pull request #65 from Qingfu-Liu/update_HR2  
    PBL and Convection and Microphysics update for HR2

for sorc/ufs_model.fd/FV3/ccpp/physics the hash is:

commit 9b0ac7b16a45afe5e7f1abf9571d3484158a5b43 (HEAD, origin/ufs/dev, origin/HEAD, ufs/dev)
Merge: 98396808 7fa55935
Author: Grant Firl <grant.firl@noaa.gov>
Date:   Wed Mar 27 11:26:20 2024 -0400
    Merge pull request #184 from lisa-bengtsson/cloudPR
    Introduce namelist flag xr_cnvcld to control if suspended grid-mean convective cloud condensate should be included in cloud fraction and optical depth calculation in the GFS suite

GeorgeGayno-NOAA commented 4 months ago

I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.

I have a fix. Replace the version of sfcsub.F in /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/sorc/ufs_utils.fd/ccpp-physics/physics with the version here: /scratch1/NCEPDEV/da/George.Gayno/cycle.broke Then, recompile ufs_utils.

It should run now with only six mpi tasks - one task per tile.

spanNOAA commented 4 months ago

I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.

I have a fix. Replace the version of sfcsub.F in /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/sorc/ufs_utils.fd/ccpp-physics/physics with the version here: /scratch1/NCEPDEV/da/George.Gayno/cycle.broke Then, recompile ufs_utils.

It should run now with only six mpi tasks - one task per tile.

The fix now successfully resolves issues for both gdassfcanl and gfssfcanl. Both tasks can be completed without any problems. Another issue is about the gdasanalcalc task, which also becomes stuck at a particular point until it exceeds the wall clock. Could you please investigate this problem as well?

RussTreadon-NOAA commented 4 months ago

C768 gdasanalcalc failure on Hera examined with the following findings.

Job gdasanalcalc copies interp_inc.x to chgres_inc.x. g-w ush/calcanl_gfs.py executes chgres_inc.x. As reported in this issue, chgres_inc.x hangs on Hera when processing C768 files.

Able to reproduce this behavior in stand-alone shell script which executes interp_inc.x. Script test_gw.sh in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi-utils uses the same job configuration as gdasanalcalc. interp_inc.x hangs and the job runs until the specified job wall time is reached.

Script test.sh (same directory) alters the job configuration and interp_inc.x runs to completion.

test_gw.sh specifies

#SBATCH --nodes=4
#SBATCH --tasks-per-node=40

whereas test.sh specifies

#SBATCH --nodes=1
#SBATCH --tasks-per-node=10

Both scripts execute interp_inc.x as srun -l -n 10 --verbose --export=ALL -c 1 $interpexe. The only differences is in the indicated SBATCH lines.

calcanal_gfs.py executes chgres_inc.x. gdasanalcalc runs chgres_inc.x with

srun -n 10 --verbose --export=ALL -c 1 --distribution=arbitrary --cpu-bind=cores

The parallel xml specifies

        <nodes>4:ppn=40:tpp=1</nodes>

for the gfs and gdas analcalc job.

The analcalc job runs several executables. interp_inc.x runs 10 tasks. calc_anal.x runs 127 tasks. gaussian_sfcanl.x runs 1 tasks. This is why the xml for analcalc specifies 4 nodes with 40 tasks per node.

I do not have a solution for the Hera hang in gdasanalcalc at C768. I am simply sharing what tests reveal.

spanNOAA commented 3 months ago

Hi @RussTreadon-NOAA, just following up on the Hera hang issue in gdasanalcalc at C768 that we discussed about a month ago. You mentioned that there wasn't a solution available at that time and shared some test results.

I wanted to check in to see if there have been any updates or progress on resolving this issue since then.

RussTreadon-NOAA commented 3 months ago

@spanNOAA , no updates from me. I am not actively working on this issue.

guoqing-noaa commented 3 months ago

@SamuelTrahanNOAA Could you take a look at this issue? Thanks!

DavidHuber-NOAA commented 1 month ago

I am looking into this. Presently, I am not able to cycle past the first half-cycle due to OOM errors, so that will need to be resolved first.

DavidHuber-NOAA commented 1 month ago

I do not have a solution for this yet, either, but I do have some additional details. The hang occurs at line 390 of driver.F90 line 390 of driver.F90. The mpi_send is successful in that the corresponding mpi_recv at line 413 is able to pick up the data and continue processing, but stops at line 422 waiting for the next mpi_send at line 392, which never comes. It is not clear to me why the mpi_send at line 390 does not return after sending the data.

DavidHuber-NOAA commented 1 month ago

The issue appears to be the size of the buffer that is sent via mpi_send and may reflect a bug in MPI, though I am not certain of that, based on this discussion. I have found a workaround and have a draft PR in place (https://github.com/NOAA-EMC/GSI-utils/pull/49). This needs to be tested for reproducibility. Is that something you could help with @guoqing-noaa?

guoqing-noaa commented 1 month ago

@DavidHuber-NOAA Thanks a lot for the help. We will test your PR#49 and update you on how it goes.

DavidHuber-NOAA commented 1 month ago

@guoqing-noaa I have opened PR #2819. The branch has other C768 fixes in it that will be helpful for testing. I had another problem with the analysis UPP job, so this is still a work in progress.

guoqing-noaa commented 1 month ago

Thanks, @DavidHuber-NOAA

spanNOAA commented 1 month ago

@DavidHuber-NOAA I have no issues with the C768 gdasanalcalc task after applying this fix.

NOAA-EMC / global-workflow