Open spanNOAA opened 5 months ago
@spanNOAA Have you compiled UFS with the unstructured wave grids (option -w
)?
@spanNOAA Have you compiled UFS with the unstructured wave grids (option
-w
)?
No, I compiled the global workflow only using the '-g' option.
@spanNOAA Are you using the top of the develop
branch for the g-w?
@spanNOAA Are you using the top of the
develop
branch for the g-w?
Yes, I'm using the develop branch.
FYI, this problem was only observed with C768. I have no issue with C384.
@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera?
@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera?
I can locate the local repo at: /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs.
@spanNOAA Thank you.
Can you please check out and/or update your current develop
branch and recompile the UFS model? You can do so as follows.
user@host:$ cd sorc
user@host:$ ./build_ufs.sh -w
That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised?
@spanNOAA Thank you.
Can you please check out and/or update your current
develop
branch and recompile the UFS model? You can do so as follows.user@host:$ cd sorc user@host:$ ./build_ufs.sh -w
That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised?
Certainly. But before doing so, may I ask two questions:
sorc/ufs_model.fd
and then build. Make sure to run sorc/link_workflow.sh
once you make the updates.These are analysis jobs and have nothing to do with -w
, don't worry about it.
C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small.
How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well?
When you mention checking out and/or updating my current develop branch, are you indicating that the entire global workflow needs updating, or is it solely the ufs model that requires to be updated? The version I mentioned is 281b32f.
@spanNOAA Thank you for the tag. We are currently testing hash 281b32fb
but encountering errors when executing the forecast (e.g., ufs_model.exe
) for C768 resolutions. As a result, the referenced UFSWM tag will not work at the moment. Please see issue #2490.
Additionally, when you tried increasing the wallclock, did you regenerate your rocoto XML afterwards?
@spanNOAA Thank you for the tag. We are currently testing hash
281b32fb
but encountering errors when executing the forecast (e.g.,ufs_model.exe
) for C768 resolutions. As a result, the referenced UFSWM tag will not work at the moment. Please see issue #2490.
These failures are in the analysis job. It is unlikely anything with UFS or its build is the problem here.
These are analysis jobs and have nothing to do with
-w
, don't worry about it.C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small.
How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well?
I attempted wallclock settings ranging from 10 to 40 minutes, but none of them works. When the wallclock was set to 20 minutes or more, the program consistently stalled at the same point. I haven't tried to increase the number of cores. According to the log, the program seems to terminate normally. However, the slurm job continues afterward for unknown reason. And I manually extended the wallclock duration by directly editing the XML file instead of using config files.
Okay, I'm going to check your full log and see if I can find anything, otherwise might need to get a specialist to look at it.
I really appreciate it.
looking at sfcanl, the problem seems to be in global_cycle
. Ranks 0-2 finish but 3-5 never do:
> egrep '^3:' gdassfcanl.log.1
3:
3: STARTING CYCLE PROGRAM ON RANK 3
3: RUNNING WITH 6 TASKS
3: AND WITH 1 THREADS.
3:
3: READ NAMCYC NAMELIST.
3:
3:
3: IN ROUTINE SFCDRV,IDIM= 768 JDIM= 768 FH=
3: 0.000000000000000E+000
3: - RUNNING WITH FRACTIONAL GRID.
3:
3: READ FV3 GRID INFO FROM: ./fngrid.004
3:
3: READ FV3 OROG INFO FROM: ./fnorog.004
3:
3: WILL PROCESS NSST RECORDS.
3:
3: READ INPUT SFC DATA FROM: ./fnbgsi.004
3: - WILL PROCESS FOR NOAH-MP LSM.
3:
3: WILL READ NSST RECORDS.
3:
3: USE UNFILTERED OROGRAPHY.
3:
3: SAVE FIRST GUESS MASK
3:
3: CALL SFCCYCLE TO UPDATE SURFACE FIELDS.
Since the ranks are tiles, they should all be similar run times. I think this points back to a memory issue. Try changing the resource request to:
<nodes>6:ppn=40:tpp=1</nodes>
That should be overkill, but if it works we can try dialing it back.
The problem remains despite increasing the nodes to 6.
@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?
@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?
Sure. The global_cycle
program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.
export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
-o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)
@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?
@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?
Sure. The
global_cycle
program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.export OMP_NUM_THREADS_CY=2 TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \ -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)
@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?
It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5c.
@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?
Sure. The
global_cycle
program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.export OMP_NUM_THREADS_CY=2 TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \ -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)
@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?
It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5.
Let me try to run the cycle step myself. Don't delete your working directories.
I was able to run your test case using my own stand-alone script - /scratch1/NCEPDEV/da/George.Gayno/cycle.broke
If I just run tile 1, there is a bottleneck in the interpolation of the GLDAS soil moisture to the tile:
in fixrdc for mon= 1 fngrib=
/scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/fix/am/global_soilmgldas.statsgo.t15
34.3072.1536.grb
The interpolation for month=1 takes 6:30 minutes. And there are many uninterpolated points:
unable to interpolate. filled with nearest point value at 359656 points
The UFS_UTILS C768 regression test, which uses a non-fractional grid, runs very quickly. And there are very few uninterpolated points:
unable to interpolate. filled with nearest point value at 309 points
The C48 regression test uses a fractional grid. It runs quickly, but there is a very high percentage of uninterpolated points:
in fixrdc for mon= 3 fngrib=
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/global_cycl
e/../../fix/am/global_soilmgldas.statsgo.t94.192.96.grb
unable to interpolate. filled with nearest point value at 1308 points
Maybe there is a problem with how the interpolation mask is being set up for fractional grids?
Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids.
Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids.
I think the mask problem is a bug in the global_cycle code. I will need to run some tests.
@spanNOAA - I found the problem and have a fix. What hash of the ccpp-physics are you using?
I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.
For sorc/ufs_utils.fd/ccpp-physics: the hash is:
commit 3a306a493a9a0b6c3c39c7b50d356f0ddb7c5c94 (HEAD)
Merge: eda81a58 17c73687
Author: Grant Firl <grant.firl@noaa.gov>
Date: Tue May 9 13:14:47 2023 -0400
Merge pull request #65 from Qingfu-Liu/update_HR2
PBL and Convection and Microphysics update for HR2
for sorc/ufs_model.fd/FV3/ccpp/physics the hash is:
commit 9b0ac7b16a45afe5e7f1abf9571d3484158a5b43 (HEAD, origin/ufs/dev, origin/HEAD, ufs/dev)
Merge: 98396808 7fa55935
Author: Grant Firl <grant.firl@noaa.gov>
Date: Wed Mar 27 11:26:20 2024 -0400
Merge pull request #184 from lisa-bengtsson/cloudPR
Introduce namelist flag xr_cnvcld to control if suspended grid-mean convective cloud condensate should be included in cloud fraction and optical depth calculation in the GFS suite
I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.
I have a fix. Replace the version of sfcsub.F in /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/sorc/ufs_utils.fd/ccpp-physics/physics
with the version here: /scratch1/NCEPDEV/da/George.Gayno/cycle.broke
Then, recompile ufs_utils.
It should run now with only six mpi tasks - one task per tile.
I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.
I have a fix. Replace the version of sfcsub.F in
/scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/sorc/ufs_utils.fd/ccpp-physics/physics
with the version here:/scratch1/NCEPDEV/da/George.Gayno/cycle.broke
Then, recompile ufs_utils.It should run now with only six mpi tasks - one task per tile.
The fix now successfully resolves issues for both gdassfcanl and gfssfcanl. Both tasks can be completed without any problems. Another issue is about the gdasanalcalc task, which also becomes stuck at a particular point until it exceeds the wall clock. Could you please investigate this problem as well?
C768 gdasanalcalc failure on Hera examined with the following findings.
Job gdasanalcalc copies interp_inc.x
to chgres_inc.x
. g-w ush/calcanl_gfs.py
executes chgres_inc.x
. As reported in this issue, chgres_inc.x
hangs on Hera when processing C768 files.
Able to reproduce this behavior in stand-alone shell script which executes interp_inc.x
. Script test_gw.sh
in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi-utils
uses the same job configuration as gdasanalcalc. interp_inc.x
hangs and the job runs until the specified job wall time is reached.
Script test.sh
(same directory) alters the job configuration and interp_inc.x
runs to completion.
test_gw.sh
specifies
#SBATCH --nodes=4
#SBATCH --tasks-per-node=40
whereas test.sh
specifies
#SBATCH --nodes=1
#SBATCH --tasks-per-node=10
Both scripts execute interp_inc.x
as srun -l -n 10 --verbose --export=ALL -c 1 $interpexe
. The only differences is in the indicated SBATCH lines.
calcanal_gfs.py
executes chgres_inc.x
. gdasanalcalc runs chgres_inc.x
with
srun -n 10 --verbose --export=ALL -c 1 --distribution=arbitrary --cpu-bind=cores
The parallel xml specifies
<nodes>4:ppn=40:tpp=1</nodes>
for the gfs and gdas analcalc job.
The analcalc job runs several executables. interp_inc.x
runs 10 tasks. calc_anal.x
runs 127 tasks. gaussian_sfcanl.x
runs 1 tasks. This is why the xml for analcalc specifies 4 nodes with 40 tasks per node.
I do not have a solution for the Hera hang in gdasanalcalc at C768. I am simply sharing what tests reveal.
Hi @RussTreadon-NOAA, just following up on the Hera hang issue in gdasanalcalc at C768 that we discussed about a month ago. You mentioned that there wasn't a solution available at that time and shared some test results.
I wanted to check in to see if there have been any updates or progress on resolving this issue since then.
@spanNOAA , no updates from me. I am not actively working on this issue.
@SamuelTrahanNOAA Could you take a look at this issue? Thanks!
I am looking into this. Presently, I am not able to cycle past the first half-cycle due to OOM errors, so that will need to be resolved first.
I do not have a solution for this yet, either, but I do have some additional details. The hang occurs at line 390 of driver.F90
line 390 of driver.F90. The mpi_send
is successful in that the corresponding mpi_recv
at line 413 is able to pick up the data and continue processing, but stops at line 422 waiting for the next mpi_send
at line 392, which never comes. It is not clear to me why the mpi_send
at line 390 does not return after sending the data.
The issue appears to be the size of the buffer that is sent via mpi_send
and may reflect a bug in MPI, though I am not certain of that, based on this discussion. I have found a workaround and have a draft PR in place (https://github.com/NOAA-EMC/GSI-utils/pull/49). This needs to be tested for reproducibility. Is that something you could help with @guoqing-noaa?
@DavidHuber-NOAA Thanks a lot for the help. We will test your PR#49 and update you on how it goes.
@guoqing-noaa I have opened PR #2819. The branch has other C768 fixes in it that will be helpful for testing. I had another problem with the analysis UPP job, so this is still a work in progress.
Thanks, @DavidHuber-NOAA
@DavidHuber-NOAA I have no issues with the C768 gdasanalcalc task after applying this fix.
What is wrong?
The gdassfcanl, gfssfcanl, and gdasanalcalc tasks encounter failure from the second cycle. Regardless of the time wall set for the job, the tasks consistently exceed the time limit.
I am attempting to run the simulations starting from 2023021018 and ending 202302261800.
Brief snippet of error from gdassfcanl.log and gfssfcanl.log file for 2023021100 forecast cycle: 0: update OUTPUT SFC DATA TO: ./fnbgso.001 0: 0: CYCLE PROGRAM COMPLETED NORMALLY ON RANK: 0 0: slurmstepd: error: STEP 58349057.0 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT slurmstepd: error: JOB 58349057 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT
Start Epilog on node h34m13 for job 58349057 :: Tue Apr 16 21:54:17 UTC 2024 Job 58349057 finished for user Sijie.Pan in partition hera with exit code 0:0
End Epilogue Tue Apr 16 21:54:17 UTC 2024
Brief snippet of error from gdasanalcalc.log file for 2023021100 forecast cycle:
. . . . . . . . . . . . . . . . . . . * . PROGRAM INTERP_INC HAS BEGUN. COMPILED 2019100.00 ORG: EMC STARTING DATE-TIME APR 15,2024 17:16:27.299 106 MON 2460416
Start Epilog on node h1m01 for job 58250207 :: Mon Apr 15 17:36:18 UTC 2024 Job 58250207 finished for user Sijie.Pan in partition bigmem with exit code 0:0
End Epilogue Mon Apr 15 17:36:18 UTC 2024
What should have happened?
The tasks 'gdassfcanl', 'gfssfcanl', and 'gdasanalcalc' generate the respective files required for the remainder of the workflow to use.
What machines are impacted?
Hera
Steps to reproduce
Additional information
You can find gdassfcanl.log, gfssfcanl.log and gdasanalcalc.log in the following directory: /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/C768_6hourly_0210/logs/2023021100
Do you have a proposed solution?
No response