NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 164 forks source link

out of memory error with init task #631

Closed jkhender closed 2 years ago

jkhender commented 2 years ago

Expected behavior init task should complete successfully

Current behavior my init tasks are failing with an 'out of memory' error

Machines affected hera

To Reproduce Run init task from global workflow. I tested with the top of develop branch as well as my older workflow used in my realtime runs.

current realtime setup EXPDIR/XML file:
/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSwfm/rt_ufscam_l127_dev1/rt_ufscam_l127_icsonly.xml logfile:
/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020200/gfsfcst.log RUNDIR:
/scratch1/BMC/gsd-fv3/NCEPDEV/stmp3/rtfim/RUNDIRS/rt_ufscam_l127_dev1/2022020200/gfs/init/

top of develop branch EXPDIR/xml file:
/scratch2/BMC/gsd-fv3-dev/Judy.K.Henderson/test/merge_gw_develop/FV3GFSwfm/jkh/ics.xml logfile: /scratch2/BMC/gsd-fv3-dev/Judy.K.Henderson/test/merge_gw_develop/FV3GFSrun/jkh/logs/2022020300/gfsinit.log RUNDIR:
/scratch1/BMC/gsd-fv3/NCEPDEV/stmp3/Judy.K.Henderson/RUNDIRS/jkh/2022020300/gfs/init

Context This error started showing up after hera maintenance on 01Feb22. I have not made any changes to my realtime files.

Additional Information I am using the default configuration in the XML file.

<!ENTITY RESOURCES_INIT_GFS "<nodes>4:ppn=6:tpp=1</nodes>">

I also tried with 5 nodes, but got the same error. I did not modify config.resources, though.

KateFriedman-NOAA commented 2 years ago

@GeorgeGayno-NOAA Likely need your inputs on adjusting resources for the gfsinit job, which runs chgres_cube via UFS_UTILS. Thanks!

GeorgeGayno-NOAA commented 2 years ago

@GeorgeGayno-NOAA Likely need your inputs on adjusting resources for the gfsinit job, which runs chgres_cube via UFS_UTILS. Thanks!

I can take a look.

GeorgeGayno-NOAA commented 2 years ago

@KateFriedman-NOAA and @jkhender - I copied Judy's fort.41 file to my own space. I compiled 'develop', then kicked off chgres as a stand-alone job (not as part of the workflow). It worked for me with 4 nodes/6 tasks per node. Go to /scratch1/NCEPDEV/da/George.Gayno/noscrub/judy to see the script I used and log file. So I am not sure why it is not working for you. Kate, I will need to learn how the workflow runs chgres.

WalterKolczynski-NOAA commented 2 years ago

I didn't see any error when I tried last night either. Only thing I could think of was @jkhender is targeting a different resolution that requires more memory to do the interpolation.

jkhender commented 2 years ago

@GeorgeGayno-NOAA @KateFriedman-NOAA - I'm running C768. I just copied what George did in my own directory and it ran fine for me standalone.

/scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg

jkhender commented 2 years ago

@GeorgeGayno-NOAA @KateFriedman-NOAA I also ran another standalone pointing to my realtime executable and files and it also completed successfully. /scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg/ufsdev1

arunchawla-NOAA commented 2 years ago

Is this still an issue @jkhender ? Is the standalone job working but not as part of workflow ?

lisa-bengtsson commented 2 years ago

@jkhender did you manage to solve this problem? I'm getting:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=28231120.0. Some of your processes may have been killed by the cgroup out-of-memo\ ry handler

in the post processing, wonder if it is related?

jkhender commented 2 years ago

Yes, this is still an issue when running the init task with the global workflow. I can run standalone job fine. My forecast and post jobs also complete successfully with the workflow.

lisa-bengtsson commented 2 years ago

Ok, thanks for getting back to me

jkhender commented 2 years ago

I don't know if this is related or not. When comparing my before/after maintenance log files, I see this difference under the ulimit -a output:

before maintenance

+ulimit -a ... nice (-e) 0 nofile (-n) 131072 nproc (-u) 380571

after maintenance

ulimit -a ... nice (-e) 0 nofile (-n) 4096 nproc (-u) 380571

before: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020100/gfsinit.log after: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020300/gfsinit.log

I checked Lisa's log files and see the same thing.

KateFriedman-NOAA commented 2 years ago

@jkhender Good catch! I know I had to add the following to my .bashrc to support running the UPP on Hera:

#For UPP
ulimit -S -s unlimited

@RaghuReddy-NOAA Was there a ulimit-related change on Hera during the Tuesday maintenance? Thanks!

KateFriedman-NOAA commented 2 years ago

Note, I haven't tried running anything on Hera this week so I can't confirm if that ulimit setting I have in my .bashrc resolves this OOM kill issue or not. I've had that in my .bashrc for a long time.

jkhender commented 2 years ago

In env/HERA.env, the unlimited setting is in there.

ulimit -s unlimited ulimit -a

KateFriedman-NOAA commented 2 years ago

In env/HERA.env, the unlimited setting is in there

Right! Forgot I put that in there. Thanks @jkhender for checking!

lisa-bengtsson commented 2 years ago

I have this in my log that failed:

grep ulimit gfspost001.log

And this in my log that ran: grep ulimit gfspost001.log.5

jkhender commented 2 years ago

@lisa-bengtsson I had checked logfiles gfspost002.log.0 and gfspost002.log.1 in your directory:

/scratch2/BMC/rem/Lisa.Bengtsson/ufs_gfsv16_prog_closure/comrot/long_progsigma/logs/2019101900

-rw-r--r-- 1 Lisa.Bengtsson rem 1204572 Feb 3 17:04 gfspost002.log.0 -rw-r--r-- 1 Lisa.Bengtsson rem 3233674 Jan 31 19:42 gfspost002.log.1

lisa-bengtsson commented 2 years ago

@jkhender I recompiled and ran, so both log.0 and log.1 are failing with memory problem. From what I can see, the last one working is gfspost001.log.5. In this one I do not see execution of HERA.env. So perhaps something else changed?

lisa-bengtsson commented 2 years ago

gfspost001.log.2 seems to have finished fine as well, before the update.

rreddy2001 commented 2 years ago

We are looking at a few things. Please submit hera helpdesk if you have already done so.

rreddy2001 commented 2 years ago

In env/HERA.env, the unlimited setting is in there

Right! Forgot I put that in there. Thanks @jkhender for checking!

Hi Kate,

If you are able to run the job outside of Rocoto successfully we would like to see how much memory does a successful run use. Can you please add the following two command to the end of your job file:

printenv report-mem

It will be helpful to get the same info for the run under Rocoto too please.

For the run that fails in Rocoto it may not get to the end of job file, so the --epilog approach documented here may provide some additional information:

https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php/Running_and_Monitoring_Jobs_on_Jet_and_Hera(Theia)_-_SLURM#How_to_Get_Memory_Usage_Information

For the epilog approach, I guess you would have to modify your APRUN (or is the LAUNCHER) environment variable (I am not very familiar with the config files, but I know you have one that is typically set to "srun").

jkhender commented 2 years ago

I added the memory information to the jobs I am running standalone. Logs are at

/scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg/fv3chem/log_0207 /scratch1/BMC/gsd-fv3-dev/Judy.K.Henderson/scratch/tmp/chgres_gg/ufsdev1/log_feb07

I will add the epilog info to the workflow tomorrow and run a test.

jkhender commented 2 years ago

I added the epilog parameter to my srun command in my global workflow and the log file is at

/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite_dev1/FV3GFSrun/rt_ufscam_l127_dev1/logs/2022020700/gfsinit.log

rreddy2001 commented 2 years ago

@jkhender thank you for those runs! I just wanted to make sure those jobs were not close to the limit, and looks like there is plenty of room for the successful runs.

KateFriedman-NOAA commented 2 years ago

@jkhender Please try the following:

1) add these two settings to your config.resources:

to the init block:

export memory_init="100GB"

to the post block:

export memory_post="20GB"

2) rerun setup_workflow*py to update the xml; you should now have these new MEMORY settings for the init and post jobs in your xml:

<!ENTITY MEMORY_INIT_GFS    "100GB">
<!ENTITY MEMORY_POST_GFS    "20GB">
<memory>&MEMORY_INIT_GFS;</memory>
<memory>&MEMORY_POST_GFS;</memory>

3) rerun the jobs that had OOM kills and let me know if they still fail

Thanks! @lisa-bengtsson FYI, this might be related

jkhender commented 2 years ago

@KateFriedman-NOAA That worked for my init tasks. I can now run to completion by adding the 100G for memory. For the post tasks, I am still seeing the out of memory error when I added 20G for memory.

jkhender commented 2 years ago

@KateFriedman-NOAA If I use 40G for post, it now completes.

KateFriedman-NOAA commented 2 years ago

@KateFriedman-NOAA If I use 40G for post, it now completes.

@jkhender Awesome, was just gonna have you try more. Your post job is likely processing more forecast hours than mine did so that makes sense.

jkhender commented 2 years ago

Do we know what changed after the hera maintenance that we needed to add the memory settings?

KateFriedman-NOAA commented 2 years ago

Not entirely sure but this jives with other recent memory-related issues we saw on Hera/Orion after SLURM updates. We need to specify memory more now, that's on us.