Closed jkhender closed 2 years ago
@GeorgeGayno-NOAA Likely need your inputs on adjusting resources for the gfsinit job, which runs chgres_cube via UFS_UTILS. Thanks!
@GeorgeGayno-NOAA Likely need your inputs on adjusting resources for the gfsinit job, which runs chgres_cube via UFS_UTILS. Thanks!
I can take a look.
@KateFriedman-NOAA and @jkhender - I copied Judy's fort.41 file to my own space. I compiled 'develop', then kicked off chgres as a stand-alone job (not as part of the workflow). It worked for me with 4 nodes/6 tasks per node. Go to /scratch1/NCEPDEV/da/George.Gayno/noscrub/judy to see the script I used and log file. So I am not sure why it is not working for you. Kate, I will need to learn how the workflow runs chgres.
I didn't see any error when I tried last night either. Only thing I could think of was @jkhender is targeting a different resolution that requires more memory to do the interpolation.
@GeorgeGayno-NOAA @KateFriedman-NOAA - I'm running C768. I just copied what George did in my own directory and it ran fine for me standalone.
/scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg
@GeorgeGayno-NOAA @KateFriedman-NOAA I also ran another standalone pointing to my realtime executable and files and it also completed successfully. /scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg/ufsdev1
Is this still an issue @jkhender ? Is the standalone job working but not as part of workflow ?
@jkhender did you manage to solve this problem? I'm getting:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=28231120.0. Some of your processes may have been killed by the cgroup out-of-memo\ ry handler
in the post processing, wonder if it is related?
Yes, this is still an issue when running the init task with the global workflow. I can run standalone job fine. My forecast and post jobs also complete successfully with the workflow.
Ok, thanks for getting back to me
I don't know if this is related or not. When comparing my before/after maintenance log files, I see this difference under the ulimit -a output:
before maintenance
+ulimit -a ... nice (-e) 0 nofile (-n) 131072 nproc (-u) 380571
after maintenance
ulimit -a ... nice (-e) 0 nofile (-n) 4096 nproc (-u) 380571
before: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020100/gfsinit.log after: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020300/gfsinit.log
I checked Lisa's log files and see the same thing.
@jkhender Good catch! I know I had to add the following to my .bashrc to support running the UPP on Hera:
#For UPP
ulimit -S -s unlimited
@RaghuReddy-NOAA Was there a ulimit-related change on Hera during the Tuesday maintenance? Thanks!
Note, I haven't tried running anything on Hera this week so I can't confirm if that ulimit
setting I have in my .bashrc resolves this OOM kill issue or not. I've had that in my .bashrc for a long time.
In env/HERA.env, the unlimited setting is in there.
ulimit -s unlimited ulimit -a
In env/HERA.env, the unlimited setting is in there
Right! Forgot I put that in there. Thanks @jkhender for checking!
I have this in my log that failed:
grep ulimit gfspost001.log
And this in my log that ran: grep ulimit gfspost001.log.5
@lisa-bengtsson I had checked logfiles gfspost002.log.0 and gfspost002.log.1 in your directory:
/scratch2/BMC/rem/Lisa.Bengtsson/ufs_gfsv16_prog_closure/comrot/long_progsigma/logs/2019101900
-rw-r--r-- 1 Lisa.Bengtsson rem 1204572 Feb 3 17:04 gfspost002.log.0 -rw-r--r-- 1 Lisa.Bengtsson rem 3233674 Jan 31 19:42 gfspost002.log.1
@jkhender I recompiled and ran, so both log.0 and log.1 are failing with memory problem. From what I can see, the last one working is gfspost001.log.5. In this one I do not see execution of HERA.env. So perhaps something else changed?
gfspost001.log.2 seems to have finished fine as well, before the update.
We are looking at a few things. Please submit hera helpdesk if you have already done so.
In env/HERA.env, the unlimited setting is in there
Right! Forgot I put that in there. Thanks @jkhender for checking!
Hi Kate,
If you are able to run the job outside of Rocoto successfully we would like to see how much memory does a successful run use. Can you please add the following two command to the end of your job file:
printenv report-mem
It will be helpful to get the same info for the run under Rocoto too please.
For the run that fails in Rocoto it may not get to the end of job file, so the --epilog approach documented here may provide some additional information:
For the epilog approach, I guess you would have to modify your APRUN (or is the LAUNCHER) environment variable (I am not very familiar with the config files, but I know you have one that is typically set to "srun").
I added the memory information to the jobs I am running standalone. Logs are at
/scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg/fv3chem/log_0207 /scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg/ufsdev1/log_feb07
I will add the epilog info to the workflow tomorrow and run a test.
I added the epilog parameter to my srun command in my global workflow and the log file is at
/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020700/gfsinit.log
@jkhender thank you for those runs! I just wanted to make sure those jobs were not close to the limit, and looks like there is plenty of room for the successful runs.
@jkhender Please try the following:
1) add these two settings to your config.resources
:
to the init block:
export memory_init="100GB"
to the post block:
export memory_post="20GB"
2) rerun setup_workflow*py to update the xml; you should now have these new MEMORY settings for the init and post jobs in your xml:
<!ENTITY MEMORY_INIT_GFS "100GB">
<!ENTITY MEMORY_POST_GFS "20GB">
<memory>&MEMORY_INIT_GFS;</memory>
<memory>&MEMORY_POST_GFS;</memory>
3) rerun the jobs that had OOM kills and let me know if they still fail
Thanks! @lisa-bengtsson FYI, this might be related
@KateFriedman-NOAA That worked for my init tasks. I can now run to completion by adding the 100G for memory. For the post tasks, I am still seeing the out of memory error when I added 20G for memory.
@KateFriedman-NOAA If I use 40G for post, it now completes.
@KateFriedman-NOAA If I use 40G for post, it now completes.
@jkhender Awesome, was just gonna have you try more. Your post job is likely processing more forecast hours than mine did so that makes sense.
Do we know what changed after the hera maintenance that we needed to add the memory settings?
Not entirely sure but this jives with other recent memory-related issues we saw on Hera/Orion after SLURM updates. We need to specify memory more now, that's on us.
Expected behavior init task should complete successfully
Current behavior my init tasks are failing with an 'out of memory' error
Machines affected hera
To Reproduce Run init task from global workflow. I tested with the top of develop branch as well as my older workflow used in my realtime runs.
current realtime setup EXPDIR/XML file:
/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSwfm/rt_ufscam_l127_dev1/rt_ufscam_l127_icsonly.xml logfile:
/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020200/gfsfcst.log RUNDIR:
/scratch1/BMC/gsd-fv3/NCEPDEV/stmp3/rtfim/RUNDIRS/rt_ufscam_l127_dev1/2022020200/gfs/init/
top of develop branch EXPDIR/xml file:
/scratch2/BMC/gsd-fv3-dev/Judy.K.Henderson/test/merge_gw_develop/FV3GFSwfm/jkh/ics.xml logfile: /scratch2/BMC/gsd-fv3-dev/Judy.K.Henderson/test/merge_gw_develop/FV3GFSrun/jkh/logs/2022020300/gfsinit.log RUNDIR:
/scratch1/BMC/gsd-fv3/NCEPDEV/stmp3/Judy.K.Henderson/RUNDIRS/jkh/2022020300/gfs/init
Context This error started showing up after hera maintenance on 01Feb22. I have not made any changes to my realtime files.
Additional Information I am using the default configuration in the XML file.
I also tried with 5 nodes, but got the same error. I did not modify config.resources, though.