Closed WalterKolczynski-NOAA closed 8 months ago
@WalterKolczynski-NOAA , gfsanal completes but gdasanal does not. Looking at the log files gdasanal has finished two outer loops. It is writing obs-anl information to diagnostic files via ncdiag. gsi.x
appears to be hanging during this process.
I copied the C96C48_hybatmDA setup to my directory. Submit gdasanal with the obs-anl diagnostics turned off via the addition of
##TEST
export SETUP='write_diag(3)=.false.,'
##TEST
to config.anal
. gdasanal ran to completion with a total elapsed time of 01:19:51 (hh:mm:ss). If I turn on the obs-anl diagnostics (ie, revert to write_diag(3)=.true.
), I expect gsi.x
will hang.
@DavidHuber-NOAA found some memory issues with ncdiag. Not sure if what he found is related to the gsi.x
hangs in gdasanal.
The GSI hash cloned with g-w dates back to 9/25/2023. When did gdasanal start failing due to exceeding the wall clock limit? Knowing when might point to system and/or library changes.
@RussTreadon-NOAA
As you noted, the GSI hash was last updated 9/25/2023. There were no test timeouts in the GSI analysis steps until last week.
This comment from a PR last week (10/24/2023) shows no timeouts in the GSI analysis jobs on Orion.
The weekly test on 10/23/2023 also had no issues. See this comment.
These timeouts started after 10/24/2023. Incidentally, Orion had PM on 10/23/2023.
Thank you, @aerorahul , for the timeline. Someone should reach out to Orion admins to see if any system changes where implemented during the 10/23 Orion PM.
Done @RussTreadon-NOAA
Rerun gdasanal on Orion with number of inner loop iterations reduced from {100, 150} to {2, 4}. Job ran to completion with a gsi.x
wall time of 2207.067311 seconds. This is an extremely high wall time for such a short analysis. My previous comment about gsi.x
hanging while writing the obs-anl netcdf diagnostics is not correct. gsi.x
is truly running much slower on Orion. Why?
Thank you @RussTreadon-NOAA
I have opened a ticket w/ Orion and have included you.
Would you be willing to include this information on that email thread for the Orion system admins information?
The above test ran gsi.x
with 200 mpi tasks. Drop this back to 84 mpi tasks, ppn=8, threads=5 and rerun gdasanal. The job ran to completion with a gsi.x
wall time of 3012.001859 seconds for a {2,4} inner loop setup. Post this result in the Orion ticket email thread.
@WalterKolczynski-NOAA , my comment that gsi.x
hangs on Orion is not correct. gsi.x
runs to completion ... it just takes a very, very long time to complete. Your original title for this issue, GSI-based analysis jobs dying after hitting wall clock limit on Orion, is more accurate.
Thanks @RussTreadon-NOAA, changed it back
Notice the following in ORION.env
# Configure MPI environment
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
#export LD_BIND_NOW=1
Are these settings necessary on Orion?
A few other questions
I didn't add them, so I can't say.
Notice the following in
ORION.env
# Configure MPI environment export MPI_BUFS_PER_PROC=2048 export MPI_BUFS_PER_HOST=2048 export MPI_GROUP_MAX=256 export MPI_MEMMAP_OFF=1 export MP_STDOUTMODE="ORDERED" export KMP_AFFINITY=scatter export OMP_STACKSIZE=2048000 export NTHSTACK=1024000000 #export LD_BIND_NOW=1
Are these settings necessary on Orion?
I didn't add them, so I can't say.
A few other questions
- Have any developers reported g-w slowdowns on Orion following the 10/23 PM?
- What do the timing for other jobs in the Orion CI suite look like? Faster, slower, or comparable to pre-10/23 timings?
I don't know the exact timings, but none of the other workflow jobs are timing out.
- What run times do we see on Hera for C96C48_hybatmDA?
- Who is actively working on this issue?
After nearly doubling the wallclock didn't work, I've been waiting for the helpdesk to respond.
So enkfgdaseobs
and enkfgdaseupd
run to completion on Orion without any problem? enkfgdaseobs
runs gsi.x
in observer mode at ensemble resolution. enkfgdaseupd
runs enkf.x
If we don't know why the MPI, MP, and KMP variables are included in ORION.env
, we should consider removing them.
Is anyone working with the Orion helpdesk on this issue?
Notice the following in
ORION.env
# Configure MPI environment export MPI_BUFS_PER_PROC=2048 export MPI_BUFS_PER_HOST=2048 export MPI_GROUP_MAX=256 export MPI_MEMMAP_OFF=1 export MP_STDOUTMODE="ORDERED" export KMP_AFFINITY=scatter export OMP_STACKSIZE=2048000 export NTHSTACK=1024000000 #export LD_BIND_NOW=1
Are these settings necessary on Orion?
A few other questions
- Have any developers reported g-w slowdowns on Orion following the 10/23 PM?
- What do the timing for other jobs in the Orion CI suite look like? Faster, slower, or comparable to pre-10/23 timings?
- What run times do we see on Hera for C96C48_hybatmDA?
- Who is actively working on this issue?
@RussTreadon-NOAA
Thank you @WalterKolczynski-NOAA and @aerorahul for the updates. Hopefully the Orion helpdesk will offer suggestions for things we can try or, better yet, identify & fix whatever is causing the slowdown.
It is my understanding that JEDI GDAS validation will be performed on Orion. JEDI developers will be running gsi.x
and fv3jedi_var.x
. Hence my interest in seeing this issue resolved.
Adding @dtkleist and @CoryMartin-NOAA for awareness.
Run the following test on Orion
GSI at ca19008b
This is the GSI hash currently checked out with g-w develop
. This GSI hash builds and runs gsi.x
using intel/2022.1.2
and impi/2022.1.2
via hpc/1.2.0
. Run g-w C96C48_hybatmDA case. This is a C96L127 4denvar analysis with a two member C48L127 ensemble valid for 2021122100. gsi.x
was run with 84 mpi tasks, 8 tasks per node, and 5 threads per task. The job ran in /work2/noaa/stmp/rtreadon/RUNDIRS/C96C48_hybatmDA/gdasanal.develop.ca19008
. It started at 2023-11-02T15:39:08. The intel/2022 gsi.x
wall time was 5643.859513 seconds (94.1 minutes). The JobId for this run was 15597695.
GSI at a6515bd8
This is the GSI hash prior to the Intel 2022 update.(GSI PR #571). This GSI hash builds and runs gsi.x
using intel/2018.4
and impi/2018.4
via hpc/1.1.0
. Run the same g-w C96C48_hybatmDA case with the same configuration. The job ran in /work2/noaa/stmp/rtreadon/RUNDIRS/C96C48_hybatmDA/gdasanal.develop.a6515bd8
.
It started at 2023-11-02T15:41:16 (about two minutes after the intel/2022 job). The intel2018 gsi.x
wall time was 3750.297334 seconds (62.5 minutes). The JobId for this run was 15597727.
Both wall times exceed the current wall clock limit of 40 minutes (2400 seconds) for the g-w C96C48_hybatmDA case. The intel/2018 gsi.x
ran considerably faster than the intel/2022 build. Why?
I did a tail -f job_logfile
as each job ran. It was clear by watching the scrolling runtime output that the intel/2018.4 gsi.x
ran noticeably faster during the minimization. This part of the gsi is compute heavy threaded code and many mpi_alltoall communications.
The above test is not a purely apple-to-apple test because the two GSI hashes have code differences in addition to compiler differences. However, ctests run prior to each GSI hash update have not shown any significant increase in GSI wall time due to source code updates. Furthermore, runs of the g-w C96C48_hybatmDA case used the same GSI hash before and after the 10/23 Orion PM. The g-w team did not report long running gsi.x
prior to the 10/23 Orion PM.
The Orion helpdesk has shared the following regarding Orion updates
What modules/compiler was gsi.x built against? We've been seeing issues with some mpi combinations related to the slurm upgrade, and it's possible gsi.x could be defaulting to run in serial instead of mpi.
the slurm upgrade meant we had to recompile mpich and openmpi, but we haven't seen issues with the intel mpi stack.
Questions:
gsi.x
appear to be more sensitive to the slurm upgrade than the intel/2018 build?hpc/1.1.0
and hpc/1.2.0
on Orion?gsi.x
?The extremely high gsi.x
wall times currently experienced on Orion reduce the usefulness of this machine for work involving gsi.x
.
@DavidHuber-NOAA , have you noticed a slowdown (ie, increased wall time) in any apps on Orion since the 10/23 PM?
For comparison the g-w C96C48_hybatmDA test runs gsi.x
on Hera with a wall time of 1046.653272 seconds (17.44 minutes). The job configuration is that same as Orion. gsi.x
is run with 84 mpi tasks, 8 tasks per node, 5 threads per task.
@RussTreadon-NOAA I have not worked with Orion very much since the 10/23 PM. The only app that I have installed/tested there is ufs_utils, though it is building with Intel 2022 and spack-stack libraries. Two executables, chgres_cube and cpld_gridgen, required additional wall time to complete at the highest resolution of each (C768 and 025, respectively; increased wall clock from 15 to 20 minutes). However, I saw similar slowdowns on Jet, so it may be a library optimization issue rather than a system one.
I will be performing additional GSI and global workflow testing on Orion shortly after a new install of spack-stack/1.5.1 goes through (likely Monday). I will keep this issue updated.
Thank you @DavidHuber-NOAA
If we need to recompile hpc/1.2.0
, who can / will do this? I'm willing to do the work. I need someone to point me at instructions. We should resolve this issue sooner rather than later.
If we need to recompile
hpc/1.2.0
, who can / will do this? I'm willing to do the work. I need someone to point me at instructions. We should resolve this issue sooner rather than later.
After discussion with Rahul & Cory an approach better aligned with other efforts will be pursued. Updates will be proved as work progresses.
@RussTreadon-NOAA
@AlexanderRichert-NOAA installed a spack-stack on Orion at /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core
@DavidHuber-NOAA is using a similar installation on Hera for his work on migrating to spack-stack installed libraries and dependencies.
Interesting finding on Orion. Script exglobal_atmos_analysis.sh
executes gsi.x
as
${NCP} ${GSIEXEC} ${DATA}
${APRUN_GSI} ${DATA}/$(basename ${GSIEXEC}) 1>&1 2>&2
export err=$?; err_chk
As a result, run time output is written to the job log file.
Run gsi.x
in observer mode with C768L128 background without FGAT (only read atmf006 background). gsi.x
wall time when writing run time output to the log file was 937.937032 seconds.
Repeat this test with one change. Modify the APRUN_GSI
line such that run time output is written to local file stdout
${NCP} ${GSIEXEC} ${DATA}
${APRUN_GSI} ${DATA}/$(basename ${GSIEXEC}) stdout 2>&1
export err=$?; err_chk
With this configuration the gsi.x
wall time dropped to 587.806638 seconds. This is a difference of 350.130394 seconds, nearly 6 minutes.
This behavior was replicated with a stand-alone GSI run script on Orion. With $APRUN_GSI ./gsi.x
, the gsi.x
wall time was 1289.723781 seconds. With $APRUN_GSI ./gsi.x > stdout 2>&1
, the gsi.x
wall time was 513.412631 seconds.
Why does the gsi.x
wall time vary so much by simply changing the destination of run time output?
/work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core
I can not access this directory due to permission restrictions
Orion-login-4:/work2/noaa/da/rtreadon/git/global-workflow/pr1997/sorc/gsi_enkf.fd$ ls -l /work/noaa/nems/arichert/ |grep spack-stack-1.4.1-gw
drwxr-s--- 9 arichert nems 4096 Nov 3 17:06 spack-stack-1.4.1-gw
I do not belong to the Orion nems
group.
/work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core
@AlexanderRichert-NOAA , when you have a chance would you change the permissions on /work/noaa/nems/arichert/spack-stack-1.4.1-gw
. I do not belong to the nems
group. I cannot access your spack-stack-1.4.1-gw
@RussTreadon-NOAA please try again
@AlexanderRichert-NOAA , I can now access /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core
. Thank you for changing directory permissions.
Unfortunately, I can compile gsi with this stack.
I used @DavidHuber-NOAA gsi_orion.lua
as a guide to modify gsi_orion.lua
from a snapshot of the authoritative GSI develop
. I used your spack-stack-1.4.1-gw
in prepend_path
.
The load of gsi_orion.lua
fails because wrf_io/1.2.0
and sfcio/1.4.1
are not found. The only wrf
module I see with spack-stack-1.4.1-gw
is FV3-wrfruc/1.0
. This won't work in gsi. module spider sfcio
does not find any sfcio
module.
The load of
gsi_orion.lua
fails becausewrf_io/1.2.0
andsfcio/1.4.1
are not found. The onlywrf
module I see withspack-stack-1.4.1-gw
isFV3-wrfruc/1.0
. This won't work in gsi.module spider sfcio
does not find anysfcio
module.
I just added sfcio/1.4.1 and wrf-io/1.2.0 to the test environment, can you try again?
The load of
gsi_orion.lua
fails becausewrf_io/1.2.0
andsfcio/1.4.1
are not found. The onlywrf
module I see withspack-stack-1.4.1-gw
isFV3-wrfruc/1.0
. This won't work in gsi.module spider sfcio
does not find anysfcio
module.I just added sfcio/1.4.1 and wrf-io/1.2.0 to the test environment, can you try again?
gsi.x
and enkf.x
compiled on Orion. gdasanal
submitted. gsi.x
is running. Based on speed at which minimization printout is being written to the job log file, gsi.x
does not appear to be running much faster. I'll let gdasanal
run to completion to get the measured gsi.x
wall time.
Looking at the gsi.x
build log, I only see references to /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/intel/2022.0.2
for gsi-ncidag-1.1.2
. Other modules are being picked up from /work/noaa/epic/role-epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2022.0.2
. The build log is file build.log
in /work2/noaa/da/rtreadon/git/global-workflow/pr1997/sorc/gsi_enkf.fd/ush
.
The modified gsi_orion.lua
used in the gsi.x
build is in /work2/noaa/da/rtreadon/git/global-workflow/pr1997/sorc/gsi_enkf.fd/modulefiles
.
@AlexanderRichert-NOAA and @DavidHuber-NOAA , did I correctly modified this file. I also had to make a few changes in gsi_common.lua
to pick up the appropriate version number of module name.
@RussTreadon-NOAA There are a couple of changes needed in gsi_common.lua
:
netCDF is now loaded through two modules (netcdf-c and netcdf-fortran), and ther versions are 4.9.2 and 4.6.0. As is, the GSI is building against netCDF 4.7.4. So replace
local netcdf_ver=os.getenv("netcdf_c_ver") or "4.9.2"
and
load(pathJoin("netcdf", netcdf_ver))
with
local netcdf_c_ver=os.getenv("netcdf_c_ver") or "4.9.2"
local netcdf_fortran_ver=os.getenv("netcdf_fortran_ver") or "4.6.0"
and
load(pathJoin("netcdf-c", netcdf_c_ver))
load(pathJoin("netcdf-fortran", netcdf_fortran_ver))
Otherwise, I think it is correct. And looking at your build log, I think it is correct it is using the unified-env
environment for everything except ncdiag. The gw
environment is built on top of the unified-env
environment so it does not have to rebuild the entire library suite, just gsi-ncdiag.
Thank you, @DavidHuber-NOAA . I updated my gsi_common.lua
and will rebuild gsi.x
.
Thank you, @DavidHuber-NOAA . I updated my
gsi_common.lua
and will rebuildgsi.x
.
Rebuilt gsi.x
running just as slow as it was with previous spack-stack build.
@WalterKolczynski-NOAA, @aerorahul , or @TerrenceMcGuinness-NOAA , what range of gsi.x
wall times do we see when running C96C48_hybatmDA on Hera? I would expect the Orion gsi.x
time for C96C48_hybatmDA to be comparable.
The gsi.x
wall time on Orion for the C96C48_hybatmDA test was 5566.158695 seconds. This gsi.x
was built using /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core
92.8 minutes to run a C96L127 analysis with a two member C48L127 ensemble is not acceptable.
Do we need to recompile every library in spack-stack-1.4.1? Do we need to add, edit, or remove environment variables set by g-w when running gsi.x
on Orion?
@RussTreadon-NOAA The new spack-stack/1.5.1 libraries are now available. I will try running the RTs against these today. Can you try switching to those as well? The needed updates to the GSI modulefiles can be found here. Let me know if you need help with the global-workflow modulefiles.
Thank you @DavidHuber-NOAA for the suggestion. I updated my working copy of gsi_orion.lua
and gsi_common.lua
based on what you have in DavidHuber-NOAA:spack-stack
. gsi.x
was rebuilt and the 2021122100 gdasanal resubmitted. The gsi.x
wall time was 6173.563277 seconds.
I rsync'd the files needed to run the C96C48_hybatmDA case to Hera, set up EXPDIR and HOMEgfs, and ran the 2021122100 gdasanal. The Hera gsi.x
wall time was 1029.501289 seconds.
The Orion and Hera runs use the same task (84), ppn (8), and thread count (5) for gsi.x
. The GSI code for each machine is identical - authoritative develop
at a3e13da
. Despite using the same code, data, and job configuration the Orion wall time is 6x what we get on Hera.
@WalterKolczynski-NOAA , @aerorahul , and @TerrenceMcGuinness-NOAA . Is my Hera gsi.x
wall time for the C96C48_hybatmDA case comparable to what you see in g-w CI testing?
As mentioned before, the gsi.x
slowdown on Orion is most evident when GSI execution is in the minimization. This part of the code is heavily threaded and contains many collective mpi calls with large buffers. Do we need to add, remove, or edit environment variables when running on Orion in light of the slurm upgrade?
Per the email thread with the Orion helpdesk, it looks like the solution is to either update the launcher command in ORION.env to explicitly set --cpus-per-task
or define the variable SRUN_CPUS_PER_TASK
. The value from the scheduler is no longer automatically applied.
Orion Helpdesk staff identified the problem and provided a fix
Per the slurm release page:
"NOTE: Beginning with 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s). "
gdasanal.sh was submitted by g-w with the following job card
#! /bin/sh
#SBATCH --job-name=C96C48_hybatmDA_gdasanal_00
#SBATCH --account=da-cpu
#SBATCH --qos=batch
#SBATCH --partition=orion
#SBATCH -t 03:00:00
#SBATCH --nodes=11-11
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=5
#SBATCH -o /work2/noaa/stmp/rtreadon/comrot/C96C48_hybatmDA/logs/2021122100/gdasanal.log
#SBATCH --export=NONE
#SBATCH --comment=b00bac22a2f20b292a6128c321f140c5
export RUN_ENVIR='emc'
export HOMEgfs='/work2/noaa/da/rtreadon/git/global-workflow/pr1997'
export EXPDIR='/work2/noaa/da/rtreadon/para_gfs/C96C48_hybatmDA'
export NET='gfs'
export CDUMP='gdas'
export RUN='gdas'
export CDATE='2021122100'
export PDY='20211221'
export cyc='00'
export COMROOT='/work/noaa/global/glopara/com'
export DATAROOT='/work2/noaa/stmp/rtreadon/RUNDIRS/C96C48_hybatmDA'
/work2/noaa/da/rtreadon/git/global-workflow/pr1997/jobs/rocoto/anal.sh
Given the Orion Helpdesk input, I added --cpus-per-task
to the srun command used to run gsi.x
. This was done by modifying env/ORION.env
as follows
Orion-login-4:/work2/noaa/da/rtreadon/git/global-workflow/pr1997$ git diff env/ORION.env
diff --git a/env/ORION.env b/env/ORION.env
index 400963d3..7a1cd63b 100755
--- a/env/ORION.env
+++ b/env/ORION.env
@@ -132,7 +132,7 @@ elif [[ "${step}" = "anal" ]] || [[ "${step}" = "analcalc" ]]; then
export NTHREADS_GSI=${nth_anal:-${nth_max}}
[[ ${NTHREADS_GSI} -gt ${nth_max} ]] && export NTHREADS_GSI=${nth_max}
- export APRUN_GSI="${launcher} -n ${npe_gsi:-${npe_anal}}"
+ export APRUN_GSI="${launcher} -n ${npe_gsi:-${npe_anal}} --cpus-per-task ${NTHREADS_GSI}"
export NTHREADS_CALCINC=${nth_calcinc:-1}
[[ ${NTHREADS_CALCINC} -gt ${nth_max} ]] && export NTHREADS_CALCINC=${nth_max}
The C96C48_hybatmDA gdasanal was resubmitted. gsi.x
ran to completion with a wall time of 1050.672364 seconds. This is consistent with Hera.
We should use this opportunity to also review and remove/edit the following settings in env/ORION.env
# Configure MPI environment
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
What is wrong?
The GSI-based analysis jobs are running out of walltime before completing on Orion.
What should have happened?
Analysis jobs should complete
What machines are impacted?
Orion
Steps to reproduce
Additional information
Unsure what conditions changed on Orion to cause this issue.
Do you have a proposed solution?
No response