NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
71 stars 163 forks source link

GSI-based analysis jobs dying after hitting wallclock limit on Orion #1996

Closed WalterKolczynski-NOAA closed 8 months ago

WalterKolczynski-NOAA commented 9 months ago

What is wrong?

The GSI-based analysis jobs are running out of walltime before completing on Orion.

What should have happened?

Analysis jobs should complete

What machines are impacted?

Orion

Steps to reproduce

  1. Set up a cycled experiment on Orion
  2. Observe workflow failure for analysis jobs

Additional information

Unsure what conditions changed on Orion to cause this issue.

Do you have a proposed solution?

No response

RussTreadon-NOAA commented 9 months ago

@WalterKolczynski-NOAA , gfsanal completes but gdasanal does not. Looking at the log files gdasanal has finished two outer loops. It is writing obs-anl information to diagnostic files via ncdiag. gsi.x appears to be hanging during this process.

I copied the C96C48_hybatmDA setup to my directory. Submit gdasanal with the obs-anl diagnostics turned off via the addition of

##TEST
export SETUP='write_diag(3)=.false.,'
##TEST

to config.anal. gdasanal ran to completion with a total elapsed time of 01:19:51 (hh:mm:ss). If I turn on the obs-anl diagnostics (ie, revert to write_diag(3)=.true.), I expect gsi.x will hang.

@DavidHuber-NOAA found some memory issues with ncdiag. Not sure if what he found is related to the gsi.x hangs in gdasanal.

The GSI hash cloned with g-w dates back to 9/25/2023. When did gdasanal start failing due to exceeding the wall clock limit? Knowing when might point to system and/or library changes.

aerorahul commented 9 months ago

@RussTreadon-NOAA As you noted, the GSI hash was last updated 9/25/2023. There were no test timeouts in the GSI analysis steps until last week.
This comment from a PR last week (10/24/2023) shows no timeouts in the GSI analysis jobs on Orion. The weekly test on 10/23/2023 also had no issues. See this comment. These timeouts started after 10/24/2023. Incidentally, Orion had PM on 10/23/2023.

RussTreadon-NOAA commented 9 months ago

Thank you, @aerorahul , for the timeline. Someone should reach out to Orion admins to see if any system changes where implemented during the 10/23 Orion PM.

aerorahul commented 9 months ago

Done @RussTreadon-NOAA

RussTreadon-NOAA commented 9 months ago

Rerun gdasanal on Orion with number of inner loop iterations reduced from {100, 150} to {2, 4}. Job ran to completion with a gsi.x wall time of 2207.067311 seconds. This is an extremely high wall time for such a short analysis. My previous comment about gsi.x hanging while writing the obs-anl netcdf diagnostics is not correct. gsi.x is truly running much slower on Orion. Why?

aerorahul commented 9 months ago

Thank you @RussTreadon-NOAA I have opened a ticket w/ Orion and have included you.
Would you be willing to include this information on that email thread for the Orion system admins information?

RussTreadon-NOAA commented 9 months ago

The above test ran gsi.x with 200 mpi tasks. Drop this back to 84 mpi tasks, ppn=8, threads=5 and rerun gdasanal. The job ran to completion with a gsi.x wall time of 3012.001859 seconds for a {2,4} inner loop setup. Post this result in the Orion ticket email thread.

RussTreadon-NOAA commented 9 months ago

@WalterKolczynski-NOAA , my comment that gsi.x hangs on Orion is not correct. gsi.x runs to completion ... it just takes a very, very long time to complete. Your original title for this issue, GSI-based analysis jobs dying after hitting wall clock limit on Orion, is more accurate.

WalterKolczynski-NOAA commented 9 months ago

Thanks @RussTreadon-NOAA, changed it back

RussTreadon-NOAA commented 9 months ago

Notice the following in ORION.env

# Configure MPI environment
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
#export LD_BIND_NOW=1

Are these settings necessary on Orion?

A few other questions

WalterKolczynski-NOAA commented 9 months ago

I didn't add them, so I can't say.

Notice the following in ORION.env

# Configure MPI environment
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
#export LD_BIND_NOW=1

Are these settings necessary on Orion?

I didn't add them, so I can't say.

A few other questions

  • Have any developers reported g-w slowdowns on Orion following the 10/23 PM?
  • What do the timing for other jobs in the Orion CI suite look like? Faster, slower, or comparable to pre-10/23 timings?

I don't know the exact timings, but none of the other workflow jobs are timing out.

  • What run times do we see on Hera for C96C48_hybatmDA?
  • Who is actively working on this issue?

After nearly doubling the wallclock didn't work, I've been waiting for the helpdesk to respond.

RussTreadon-NOAA commented 9 months ago

So enkfgdaseobs and enkfgdaseupd run to completion on Orion without any problem? enkfgdaseobs runs gsi.x in observer mode at ensemble resolution. enkfgdaseupd runs enkf.x

If we don't know why the MPI, MP, and KMP variables are included in ORION.env, we should consider removing them.

Is anyone working with the Orion helpdesk on this issue?

aerorahul commented 9 months ago

Notice the following in ORION.env

# Configure MPI environment
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
#export LD_BIND_NOW=1

Are these settings necessary on Orion?

A few other questions

  • Have any developers reported g-w slowdowns on Orion following the 10/23 PM?
  • What do the timing for other jobs in the Orion CI suite look like? Faster, slower, or comparable to pre-10/23 timings?
  • What run times do we see on Hera for C96C48_hybatmDA?
  • Who is actively working on this issue?

@RussTreadon-NOAA

RussTreadon-NOAA commented 9 months ago

Thank you @WalterKolczynski-NOAA and @aerorahul for the updates. Hopefully the Orion helpdesk will offer suggestions for things we can try or, better yet, identify & fix whatever is causing the slowdown.

It is my understanding that JEDI GDAS validation will be performed on Orion. JEDI developers will be running gsi.x and fv3jedi_var.x. Hence my interest in seeing this issue resolved.

Adding @dtkleist and @CoryMartin-NOAA for awareness.

RussTreadon-NOAA commented 9 months ago

Run the following test on Orion

GSI at ca19008b This is the GSI hash currently checked out with g-w develop. This GSI hash builds and runs gsi.x using intel/2022.1.2 and impi/2022.1.2 via hpc/1.2.0. Run g-w C96C48_hybatmDA case. This is a C96L127 4denvar analysis with a two member C48L127 ensemble valid for 2021122100. gsi.x was run with 84 mpi tasks, 8 tasks per node, and 5 threads per task. The job ran in /work2/noaa/stmp/rtreadon/RUNDIRS/C96C48_hybatmDA/gdasanal.develop.ca19008. It started at 2023-11-02T15:39:08. The intel/2022 gsi.x wall time was 5643.859513 seconds (94.1 minutes). The JobId for this run was 15597695.

GSI at a6515bd8 This is the GSI hash prior to the Intel 2022 update.(GSI PR #571). This GSI hash builds and runs gsi.x using intel/2018.4 and impi/2018.4 via hpc/1.1.0. Run the same g-w C96C48_hybatmDA case with the same configuration. The job ran in /work2/noaa/stmp/rtreadon/RUNDIRS/C96C48_hybatmDA/gdasanal.develop.a6515bd8. It started at 2023-11-02T15:41:16 (about two minutes after the intel/2022 job). The intel2018 gsi.x wall time was 3750.297334 seconds (62.5 minutes). The JobId for this run was 15597727.

Both wall times exceed the current wall clock limit of 40 minutes (2400 seconds) for the g-w C96C48_hybatmDA case. The intel/2018 gsi.x ran considerably faster than the intel/2022 build. Why?

I did a tail -f job_logfile as each job ran. It was clear by watching the scrolling runtime output that the intel/2018.4 gsi.x ran noticeably faster during the minimization. This part of the gsi is compute heavy threaded code and many mpi_alltoall communications.

The above test is not a purely apple-to-apple test because the two GSI hashes have code differences in addition to compiler differences. However, ctests run prior to each GSI hash update have not shown any significant increase in GSI wall time due to source code updates. Furthermore, runs of the g-w C96C48_hybatmDA case used the same GSI hash before and after the 10/23 Orion PM. The g-w team did not report long running gsi.x prior to the 10/23 Orion PM.

The Orion helpdesk has shared the following regarding Orion updates

What modules/compiler was gsi.x built against? We've been seeing issues with some mpi combinations related to the slurm upgrade, and it's possible gsi.x could be defaulting to run in serial instead of mpi.

the slurm upgrade meant we had to recompile mpich and openmpi, but we haven't seen issues with the intel mpi stack.

Questions:

The extremely high gsi.x wall times currently experienced on Orion reduce the usefulness of this machine for work involving gsi.x.

RussTreadon-NOAA commented 9 months ago

@DavidHuber-NOAA , have you noticed a slowdown (ie, increased wall time) in any apps on Orion since the 10/23 PM?

RussTreadon-NOAA commented 9 months ago

For comparison the g-w C96C48_hybatmDA test runs gsi.x on Hera with a wall time of 1046.653272 seconds (17.44 minutes). The job configuration is that same as Orion. gsi.x is run with 84 mpi tasks, 8 tasks per node, 5 threads per task.

DavidHuber-NOAA commented 9 months ago

@RussTreadon-NOAA I have not worked with Orion very much since the 10/23 PM. The only app that I have installed/tested there is ufs_utils, though it is building with Intel 2022 and spack-stack libraries. Two executables, chgres_cube and cpld_gridgen, required additional wall time to complete at the highest resolution of each (C768 and 025, respectively; increased wall clock from 15 to 20 minutes). However, I saw similar slowdowns on Jet, so it may be a library optimization issue rather than a system one.

I will be performing additional GSI and global workflow testing on Orion shortly after a new install of spack-stack/1.5.1 goes through (likely Monday). I will keep this issue updated.

RussTreadon-NOAA commented 9 months ago

Thank you @DavidHuber-NOAA

RussTreadon-NOAA commented 9 months ago

If we need to recompile hpc/1.2.0, who can / will do this? I'm willing to do the work. I need someone to point me at instructions. We should resolve this issue sooner rather than later.

RussTreadon-NOAA commented 9 months ago

If we need to recompile hpc/1.2.0, who can / will do this? I'm willing to do the work. I need someone to point me at instructions. We should resolve this issue sooner rather than later.

After discussion with Rahul & Cory an approach better aligned with other efforts will be pursued. Updates will be proved as work progresses.

aerorahul commented 9 months ago

@RussTreadon-NOAA @AlexanderRichert-NOAA installed a spack-stack on Orion at /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core @DavidHuber-NOAA is using a similar installation on Hera for his work on migrating to spack-stack installed libraries and dependencies.

RussTreadon-NOAA commented 9 months ago

Interesting finding on Orion. Script exglobal_atmos_analysis.sh executes gsi.x as

${NCP} ${GSIEXEC} ${DATA}
${APRUN_GSI} ${DATA}/$(basename ${GSIEXEC}) 1>&1 2>&2
export err=$?; err_chk

As a result, run time output is written to the job log file.

Run gsi.x in observer mode with C768L128 background without FGAT (only read atmf006 background). gsi.x wall time when writing run time output to the log file was 937.937032 seconds.

Repeat this test with one change. Modify the APRUN_GSI line such that run time output is written to local file stdout

${NCP} ${GSIEXEC} ${DATA}
${APRUN_GSI} ${DATA}/$(basename ${GSIEXEC}) stdout 2>&1
export err=$?; err_chk

With this configuration the gsi.x wall time dropped to 587.806638 seconds. This is a difference of 350.130394 seconds, nearly 6 minutes.

This behavior was replicated with a stand-alone GSI run script on Orion. With $APRUN_GSI ./gsi.x, the gsi.x wall time was 1289.723781 seconds. With $APRUN_GSI ./gsi.x > stdout 2>&1, the gsi.x wall time was 513.412631 seconds.

Why does the gsi.x wall time vary so much by simply changing the destination of run time output?

RussTreadon-NOAA commented 9 months ago

/work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core

I can not access this directory due to permission restrictions

Orion-login-4:/work2/noaa/da/rtreadon/git/global-workflow/pr1997/sorc/gsi_enkf.fd$ ls -l /work/noaa/nems/arichert/ |grep spack-stack-1.4.1-gw
drwxr-s---  9 arichert nems 4096 Nov  3 17:06 spack-stack-1.4.1-gw

I do not belong to the Orion nems group.

RussTreadon-NOAA commented 9 months ago

/work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core

@AlexanderRichert-NOAA , when you have a chance would you change the permissions on /work/noaa/nems/arichert/spack-stack-1.4.1-gw. I do not belong to the nems group. I cannot access your spack-stack-1.4.1-gw

AlexanderRichert-NOAA commented 9 months ago

@RussTreadon-NOAA please try again

RussTreadon-NOAA commented 9 months ago

@AlexanderRichert-NOAA , I can now access /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core. Thank you for changing directory permissions.

Unfortunately, I can compile gsi with this stack.

I used @DavidHuber-NOAA gsi_orion.lua as a guide to modify gsi_orion.lua from a snapshot of the authoritative GSI develop. I used your spack-stack-1.4.1-gw in prepend_path.

The load of gsi_orion.lua fails because wrf_io/1.2.0 and sfcio/1.4.1 are not found. The only wrf module I see with spack-stack-1.4.1-gw is FV3-wrfruc/1.0. This won't work in gsi. module spider sfcio does not find any sfcio module.

AlexanderRichert-NOAA commented 9 months ago

The load of gsi_orion.lua fails because wrf_io/1.2.0 and sfcio/1.4.1 are not found. The only wrf module I see with spack-stack-1.4.1-gw is FV3-wrfruc/1.0. This won't work in gsi. module spider sfcio does not find any sfcio module.

I just added sfcio/1.4.1 and wrf-io/1.2.0 to the test environment, can you try again?

RussTreadon-NOAA commented 9 months ago

The load of gsi_orion.lua fails because wrf_io/1.2.0 and sfcio/1.4.1 are not found. The only wrf module I see with spack-stack-1.4.1-gw is FV3-wrfruc/1.0. This won't work in gsi. module spider sfcio does not find any sfcio module.

I just added sfcio/1.4.1 and wrf-io/1.2.0 to the test environment, can you try again?

gsi.x and enkf.x compiled on Orion. gdasanal submitted. gsi.x is running. Based on speed at which minimization printout is being written to the job log file, gsi.x does not appear to be running much faster. I'll let gdasanal run to completion to get the measured gsi.x wall time.

Looking at the gsi.x build log, I only see references to /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/intel/2022.0.2 for gsi-ncidag-1.1.2. Other modules are being picked up from /work/noaa/epic/role-epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2022.0.2. The build log is file build.log in /work2/noaa/da/rtreadon/git/global-workflow/pr1997/sorc/gsi_enkf.fd/ush.

The modified gsi_orion.lua used in the gsi.x build is in /work2/noaa/da/rtreadon/git/global-workflow/pr1997/sorc/gsi_enkf.fd/modulefiles.

@AlexanderRichert-NOAA and @DavidHuber-NOAA , did I correctly modified this file. I also had to make a few changes in gsi_common.lua to pick up the appropriate version number of module name.

DavidHuber-NOAA commented 9 months ago

@RussTreadon-NOAA There are a couple of changes needed in gsi_common.lua:

netCDF is now loaded through two modules (netcdf-c and netcdf-fortran), and ther versions are 4.9.2 and 4.6.0. As is, the GSI is building against netCDF 4.7.4. So replace

local netcdf_ver=os.getenv("netcdf_c_ver") or "4.9.2" and load(pathJoin("netcdf", netcdf_ver))

with

local netcdf_c_ver=os.getenv("netcdf_c_ver") or "4.9.2" local netcdf_fortran_ver=os.getenv("netcdf_fortran_ver") or "4.6.0" and load(pathJoin("netcdf-c", netcdf_c_ver)) load(pathJoin("netcdf-fortran", netcdf_fortran_ver))

Otherwise, I think it is correct. And looking at your build log, I think it is correct it is using the unified-env environment for everything except ncdiag. The gw environment is built on top of the unified-env environment so it does not have to rebuild the entire library suite, just gsi-ncdiag.

RussTreadon-NOAA commented 9 months ago

Thank you, @DavidHuber-NOAA . I updated my gsi_common.lua and will rebuild gsi.x.

RussTreadon-NOAA commented 9 months ago

Thank you, @DavidHuber-NOAA . I updated my gsi_common.lua and will rebuild gsi.x.

Rebuilt gsi.x running just as slow as it was with previous spack-stack build.

@WalterKolczynski-NOAA, @aerorahul , or @TerrenceMcGuinness-NOAA , what range of gsi.x wall times do we see when running C96C48_hybatmDA on Hera? I would expect the Orion gsi.x time for C96C48_hybatmDA to be comparable.

RussTreadon-NOAA commented 9 months ago

The gsi.x wall time on Orion for the C96C48_hybatmDA test was 5566.158695 seconds. This gsi.x was built using /work/noaa/nems/arichert/spack-stack-1.4.1-gw/envs/gw/install/modulefiles/Core 92.8 minutes to run a C96L127 analysis with a two member C48L127 ensemble is not acceptable.

Do we need to recompile every library in spack-stack-1.4.1? Do we need to add, edit, or remove environment variables set by g-w when running gsi.x on Orion?

DavidHuber-NOAA commented 9 months ago

@RussTreadon-NOAA The new spack-stack/1.5.1 libraries are now available. I will try running the RTs against these today. Can you try switching to those as well? The needed updates to the GSI modulefiles can be found here. Let me know if you need help with the global-workflow modulefiles.

RussTreadon-NOAA commented 9 months ago

Thank you @DavidHuber-NOAA for the suggestion. I updated my working copy of gsi_orion.lua and gsi_common.lua based on what you have in DavidHuber-NOAA:spack-stack. gsi.x was rebuilt and the 2021122100 gdasanal resubmitted. The gsi.x wall time was 6173.563277 seconds.

I rsync'd the files needed to run the C96C48_hybatmDA case to Hera, set up EXPDIR and HOMEgfs, and ran the 2021122100 gdasanal. The Hera gsi.x wall time was 1029.501289 seconds.

The Orion and Hera runs use the same task (84), ppn (8), and thread count (5) for gsi.x. The GSI code for each machine is identical - authoritative develop at a3e13da. Despite using the same code, data, and job configuration the Orion wall time is 6x what we get on Hera.

@WalterKolczynski-NOAA , @aerorahul , and @TerrenceMcGuinness-NOAA . Is my Hera gsi.x wall time for the C96C48_hybatmDA case comparable to what you see in g-w CI testing?

As mentioned before, the gsi.x slowdown on Orion is most evident when GSI execution is in the minimization. This part of the code is heavily threaded and contains many collective mpi calls with large buffers. Do we need to add, remove, or edit environment variables when running on Orion in light of the slurm upgrade?

WalterKolczynski-NOAA commented 9 months ago

Per the email thread with the Orion helpdesk, it looks like the solution is to either update the launcher command in ORION.env to explicitly set --cpus-per-task or define the variable SRUN_CPUS_PER_TASK. The value from the scheduler is no longer automatically applied.

RussTreadon-NOAA commented 9 months ago

Orion Helpdesk staff identified the problem and provided a fix

Per the slurm release page:

"NOTE: Beginning with 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s). "

gdasanal.sh was submitted by g-w with the following job card

#! /bin/sh
#SBATCH --job-name=C96C48_hybatmDA_gdasanal_00
#SBATCH --account=da-cpu
#SBATCH --qos=batch
#SBATCH --partition=orion
#SBATCH -t 03:00:00
#SBATCH --nodes=11-11
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=5
#SBATCH -o /work2/noaa/stmp/rtreadon/comrot/C96C48_hybatmDA/logs/2021122100/gdasanal.log
#SBATCH --export=NONE
#SBATCH --comment=b00bac22a2f20b292a6128c321f140c5
export RUN_ENVIR='emc'
export HOMEgfs='/work2/noaa/da/rtreadon/git/global-workflow/pr1997'
export EXPDIR='/work2/noaa/da/rtreadon/para_gfs/C96C48_hybatmDA'
export NET='gfs'
export CDUMP='gdas'
export RUN='gdas'
export CDATE='2021122100'
export PDY='20211221'
export cyc='00'
export COMROOT='/work/noaa/global/glopara/com'
export DATAROOT='/work2/noaa/stmp/rtreadon/RUNDIRS/C96C48_hybatmDA'
/work2/noaa/da/rtreadon/git/global-workflow/pr1997/jobs/rocoto/anal.sh

Given the Orion Helpdesk input, I added --cpus-per-task to the srun command used to run gsi.x. This was done by modifying env/ORION.env as follows

Orion-login-4:/work2/noaa/da/rtreadon/git/global-workflow/pr1997$ git diff env/ORION.env
diff --git a/env/ORION.env b/env/ORION.env
index 400963d3..7a1cd63b 100755
--- a/env/ORION.env
+++ b/env/ORION.env
@@ -132,7 +132,7 @@ elif [[ "${step}" = "anal" ]] || [[ "${step}" = "analcalc" ]]; then

     export NTHREADS_GSI=${nth_anal:-${nth_max}}
     [[ ${NTHREADS_GSI} -gt ${nth_max} ]] && export NTHREADS_GSI=${nth_max}
-    export APRUN_GSI="${launcher} -n ${npe_gsi:-${npe_anal}}"
+    export APRUN_GSI="${launcher} -n ${npe_gsi:-${npe_anal}} --cpus-per-task ${NTHREADS_GSI}"

     export NTHREADS_CALCINC=${nth_calcinc:-1}
     [[ ${NTHREADS_CALCINC} -gt ${nth_max} ]] && export NTHREADS_CALCINC=${nth_max}

The C96C48_hybatmDA gdasanal was resubmitted. gsi.x ran to completion with a wall time of 1050.672364 seconds. This is consistent with Hera.

We should use this opportunity to also review and remove/edit the following settings in env/ORION.env

# Configure MPI environment
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000