watchdog kills wrong processes with multicore pilots

belforte commented 9 years ago

with multicore pilots, multiple payloads are run by same pilot and have processes in same process group This makes watchdog do two bad things 1) uses e.g. memory consumption by another job to interrupt current one 2) when triggered, kills all processes in the group, even if from other jobs

this was detected with HammerCloud at FNAL which runs with same user as production, so 2.3GB RSS limit was triggered by a production job.

more details from the mail exchange in following comments

belforte commented 9 years ago

Early diagnostics from Nicolo', with pointers to logs:

Hi Stefano,

On 08/01/2014 06:58 PM, Stefano Belforte wrote:

I had also looked at it, but the only thing I can say is that the job slowly increased in memory usage till the 2.3GB mark. THis of course makes no sense since those jobs are all identical, they should all use the same (little) memory.

this is one of the successful jobs. Notice that it also had an increasing memory usage, but it completed successfully just before hitting the memory limit:

http://vocms228.cern.ch/test_12167/gangadir/workspace/gangarbt/LocalXML/16/output/res/Watchdog_1.log.gz

On the other hand, this is one of the successful jobs which didn't increase so much in memory:

http://vocms228.cern.ch/test_12167/gangadir/workspace/gangarbt/LocalXML/16/output/res/Watchdog_3.log.gz

The only obvious difference: the failed jobs and the "bad" successful jobs have lines like this in the Watchdog log:

Jul 30 01:54:37 CDT 38484 1641216 2023556 615 89 4669 cmsRun -j FrameworkJobReport.xml PSet.py

Whereas the "good" successful jobs don't.

But I have no idea if this means anything.

Cheers N.

I have no clue to offer. stefano

On 08/01/2014 06:41 PM, Nicolo Magini wrote:

Hi Burt,

(sorry for the delay)

On 07/29/2014 03:07 AM, Burt Holzman wrote:

Thanks Nicolo. Who can help debug these jobs?

some CMSSW or AnalysisOps expert? From the HC support side, we can just point you to the relevant debugging information. E.g. this directory contains the configs, crab reports and job reports of 5 such failed jobs:

http://vocms228.cern.ch/test_12167/gangadir/workspace/gangarbt/LocalXML/16/output/

For example this watchdog log:

http://vocms228.cern.ch/test_12167/gangadir/workspace/gangarbt/LocalXML/16/output/res/Watchdog_2.log.gz

Though at first sight I don't see anything too strange in there.

As usual you can get there navigating from the HC homepage and looking for FNAL failed jobs in the "Jobs" tag.

I don't think there's anything sitewise for me to do here, and I don't understand why it's only FNAL that sees this (and it looks like only jobs submitted to cmsosgce4 ?)

It is true that the memory-related failures seem to be concentrated on cmsosgce4

N

B

On Jul 28, 2014, at 4:18 PM, "Nicolo Magini"Nicolo.Magini@cern.ch wrote:

Ciao,

Valentina Mancinelli for HC administration, and the Site Support team plus myself for expertise on the tests. Adding everyone in cc. Please remind that you can fetch the CMSSW_x.stdout logs directly from the HC website. E.g. this looks like one of the jobs killed at FNAL for excessive memory usage:

http://vocms207.cern.ch/test_12148/gangadir/workspace/gangarbt/LocalXML/92/output/res/CMSSW_9.stdout

If you need more FNAL logs you can search here, these are the most recent T1 HC test submissions:

http://hammercloud.cern.ch/hc/app/cms/test/12157/ http://hammercloud.cern.ch/hc/app/cms/test/12148/ http://hammercloud.cern.ch/hc/app/cms/test/12138/ http://hammercloud.cern.ch/hc/app/cms/test/12129/

Click on "Jobs" then look for a FNAL failed job and click on its "Sj" (subjob ID).

Cheers Nicolo'

On 28/07/2014 22:12, Stefano Belforte wrote: now that Andrea Sciaba' is away (until 15 Aug), who's following on HC ? Burt H. asked me about some problems at FNAL he reported to the meeting with HC jobs running out of RSS bound, but I need the job wrapper logs (CMSSW_x.stdout).

stefano

belforte commented 9 years ago

Hint from Burt:

On 8/1/2014 3:23 PM, Stefano Belforte wrote:

last note. If what I guessed is correct, watchdog kicks in even if HC job is OK, because there is a production job which takes 2.3GB plus of RSS. Which is very uncommon.

What about a multicore pilot that's using dynamic partitioning and handling single-core jobs? That might explain why this is kicking in for cmsosgce4 only (that may be the only gatekeeper configured at the factory to do multicore pilots?)

B

belforte commented 9 years ago

Suggestion for a fix from Burt: On 8/1/2014 11:53 PM, Burt Holzman wrote:

On 8/1/2014 3:23 PM, Stefano Belforte wrote:

last note. If what I guessed is correct, watchdog kicks in even if HC job is OK, because there is a production job which takes 2.3GB plus of RSS. Which is very uncommon.

What about a multicore pilot that's using dynamic partitioning and handling single-core jobs? That might explain why this is kicking in for cmsosgce4 only (that may be the only gatekeeper configured at the factory to do multicore pilots?)

If this is the cause and the CRAB watchdog checks the pgrp, one solution might be to have an extra script in the beginning that invokes setsid on the main crab driver script?

B

belforte commented 9 years ago

my reaction: thanks Burt, indeed watchdog used pgrp https://github.com/dmwm/CRAB2/blob/79eae4fd9beaa651ab25e7be909aa14a6c860deb/python/crabWatchdog.sh#L109 so while this is first time I hear "setsid" word, your suggestion can surely be implemented. But.. realistically.. Crab2 is end of line now. Do we really want to fix this ?

What about:

a temporary kludge (like avoid some CE in HC or make it ignore the RSS limit somehow)
Crab3 will not have the problem, since relies on condor's accounting of used RSS
assum/hope that user jobs will be running in Crab3 before multicore pilots are so wide spreas that this becomes a larger issue
do the setsid in the pilot anyhow (for better payload separation) sort as extra precaution and future proofing.

Or maybe 4. is so easy that can be done very quickly ?

Stefano

belforte commented 9 years ago

Burt explains it is not so easy...

So Crab2 is EOL but this is a bug -- when do we drop support completely?

I don't want to avoid sending to 1/3 gatekeepers in HC, we want to split problems in thirds if something goes wrong with grid submission at the T1.
Ok.
Should we shut off multi-core submission at FNAL at the factory until this is fixed or CRAB2 is not used for user analysis?
The pilot is too early for that -- all the jobs inherit from its pgrp. It might be possible to hack this in the factory or frontend via USER_JOB_WRAPPER, but I think that's pretty hacky.

belforte commented 9 years ago

sort of "final" understanding

HC can be fixed via the configuration file to ignore the RSS limit. So if you want to do 3) right now for a few days until HC is patched, fine.

That done, we are left with the possibility that two jobs from same user run in same multicore pilot and one going out of bound kills both. Ugly, but maybe we can live with it ? I am not against fixing bugs in Crab2, but I am fearful of introducing new ones.

As permanent solution (4.) I have opened a Git Issue https://github.com/dmwm/CRAB2/issues/1141

But adding another wrapper around crab wrapper does not strike as very good. I still think it is multicore pilot's duty to isolate payloads. Looking at process tree [*] there is one condor_startd forked three condor_starter. Of those starters one appear idle, one runs Crab (HC), one runs Production. Seems to me that those condor_starter should be in separate groups, so processes inside each are more separated. Of course two payloads run on same user Id are never "well separated".

[*] 4832 ? SN 0:00 _ /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/main/condor/sbin/condor_master -f -pidfile /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/condormaster2.pid 4844 ? SN 0:30 condor_procd -A /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/log/procd_address -L /storage/local/data1/condor/execute/dir_58048/glideGIpI8g/log/ProcLog -R 10000000 -S 60 -C 9811 4864 ? SN 0:12 condorstartd -f 9827 ? SN 0:01 condor_starter -f -a slot11 vocms228.cern.ch 9858 ? SN 0:01 condor_starter -f -a slot14 vocms228.cern.ch 10871 ? SN 0:00 | /bin/sh /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/execute/dir_9858/condorexec.exe 2 1 13605 ? SN 0:06 | /bin/bash /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/execute/dir9858/crabWatchdog.sh 20259 ? SN 0:00 | /usr/bin/time -f %U %S %P -o cpu_timing.txt cmsRun -j /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/execute/dir_9858/crab_fjr2.xml -p pset.py 20260 ? RNl 83:41 | cmsRun -j /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/execute/dir_9858/crab_fjr2.xml -p pset.py 31915 ? SN 0:00 condor_starter -f -a slot12 vocms142.cern.ch 31967 ? SN 0:00 /bin/bash /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/execute/dir_31915/condor_exec.exe franzoni_RVCMSSW_7_1_2TTJets_MassiveBinDECAY_PU50ns__dvmc_140724_1441301513-Sandbox.tar.bz2 324711 32030 ? SNl 0:01 python2.6 Startup.py 38270 ? SN 0:00 _ /bin/bash /storage/local/data1/condor/execute/dir_58048/glide_GIpI8g/execute/dir_31915/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh slc6_amd64_gcc481 scramv1 CMSSW CMSSW_7_1_2 FrameworkJobReport.xml cmsRun PSet.py

belforte commented 9 years ago

this is almost only relevant for HC, which runs with same user as production. Normal user will only be affected if they run two different tasks, land on same multicore pilot, and only one of the task has memory problems. For HC there is a fix via additional_input_file=rssLimit to disable rss check. Also Burt said they may be able to separate process group in gWms. Besides Crab3 is coming. So let's try to live with the bug

dmwm / CRAB2

watchdog kills wrong processes with multicore pilots #1141