cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Mark `cmsRun` as a likely candidate for the kernel OOM killer #45855

Open fwyzard opened 2 weeks ago

fwyzard commented 2 weeks ago

On a worker node without hard memory limits, cmsRun may occasionally cause an out-of-memory (OOM) situation that leads to a system process being killed by the kernel.

The kernel can be "encouraged" to kill a cmsRun process instead of some other process setting /proc/PID/oom_score_adj to a value larger than 0, up to 1000 (see man oom_score_adj).

As this needs to be set for each process, would it make sense to let cmsRun set the value itself when it starts, writing to /proc/self/oom_score_adj ?

We could use something like process.options.oomScoreAdjust to make it configurable, and start with a default value between 100 (somewhat more likely) and 500 (much more likely).

fwyzard commented 2 weeks ago

assign core

cmsbuild commented 2 weeks ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 weeks ago

cms-bot internal usage

cmsbuild commented 2 weeks ago

A new Issue was created by @fwyzard.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here