ericvaandering commented 10 years ago

Original Savannah ticket 101295 reported by belforte on Tue Apr 23 06:31:32 2013.

pfff... this seems a problem in the crab code that computes CPU time. I spend some time digging for https://vocms83.cern.ch//130418//3764.25

But it seems that it was resubmitted or something like that. Because from condor_history that'd be job 84 int his task http://dashb-cms-job-task.cern.ch/dashboard/request.py/taskmonitoring#action=taskJobs&usergridname=PaoloDini&taskmonid=dini_ReReco_2011B_DoubleMu4-PtIso-dR03-PiPt05_46o5gh&what=all

But when I look at WN log for that, job completed OK in 8k seconds and condorId is different MonitorJobID=84_https://vocms83.cern.ch//130422//4666.2 MonitorID=dini_ReReco_2011B_DoubleMu4-PtIso-dR03-PiPt05_46o5gh

OK, looking instead at job 82 in that task, I see that CPU time was reported as -1 to dashboard. that'd be http://dashb-cms-job-task.cern.ch/dashboard/request.py/taskmonitoring#action=resubmittedjobs&usergridname=PaoloDini&timerange=lastWeek&what=ALL&taskjobid=621181961&taskmonid=dini_ReReco_2011B_DoubleMu4-PtIso-dR03-PiPt05_46o5gh

I guess I should make sure -1 is never reported. OTOH I have no idea why ps shows a huge value for cpu time [*] seems like in this case I am getting the value for the pilot including possibly earlier jobs, maybe glexec did not do the right thing ? I shold not see startd in the process tree

bottom line, I am not sure we want dashboard to ignore those odd values, since they are actual problems for us to fix, for painful that it may be to fix

Stefano

[*]

TIME PID RSS(KB) VSZ(KB) Dsk(MB) tCPU(s) tWALL(s) COMMAND

Apr 18 14:39:30 EDT 16086 1756 14224 0 144093282 0 -sh Apr 18 14:39:30 EDT 46634 8204 101188 0 488963602 0 /tmp/glide_TGAYW7/main/condor/sbin/condor_master -f -pidfile /tmp/glide_TGAYW7/condor_master2.pid Apr 18 14:39:30 EDT 47298 8900 101652 0 918594676 0 condor_startd -f Apr 18 14:39:30 EDT --- 8900 101652 4 ---- ---- ----

JOB HIT PREDEFINED RESOURCE LIMIT ! PROCESSING HALTED
CPU TIME value is 918594676 while limit is 78600

stefano

On 04/23/2013 10:51 AM, Julia Andreeva wrote: > > > Hi James, > > The problem is due to some strange value reported from several jobs of > user Paolo Dini. These jobs all finished with exit code 50663, > they reported CPU usage of the order of 50K - 220K days though they run > for few minutes. > > Examples of the job ID as it was reported to dashboard is: > https://vocms83.cern.ch//130418//3764.25 > https://vocms83.cern.ch//130418//3764.28 > > We have a protection on the level of the histoprical views, we ignore > such strange reports there, but we do not check anythingat the level of > recording values from the collector to the DB. So in the interactive > view you see what we got. Would you like us to change an API to add a > check that CPU <=wallclock time? > > Thank you > > Cheers > > Julia > > > On Mon, 22 Apr 2013, letts wrote: > >> Thanks, Julia. >> >> In the meantime if I learn anything new I will tell you. >> >> James >> >> On Apr 22, 2013, at 1:20 PM, Julia Andreeva wrote: >> >>> Hi James, >>> I'll check tomorrow. >>> >>> Cheers >>> >>> Julia >>> ____ >>> From: letts [jletts@ucsd.edu] >>> Sent: 22 April 2013 22:10 >>> To: Julia Andreeva >>> Cc: cms-crab-operations (daily Crab operation issues and discussion) >>> Subject: inflated CPU time numbers from Purdue last week >>> >>> Hi Julia, >>> >>> Do you have any idea why the CPU time reported from Purdue for >>> analysis jobs last week is so crazy? I'm trying to figure out if there >>> is a problem on our end, or some kind of weird site/user problem? >>> >>> http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/ >>> #user=&refresh=0&table=Jobs&p=1&records=25&sorting=20&sorting=desc&activemenu=0&usr=&site=&submissiontool=&application=&activity=analysis&status=&check=submitted&tier=&date1=2013-04-15+19%3A54&date2=2013-04-22+19%3A54&sortby=site&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype= >>> >>> >>> Regards, >>> >>> James >> >>

ericvaandering commented 10 years ago

Comment by belforte on Tue Apr 23 06:33:18 2013

this is likely due to this initialization ScriptWriter.py: txt += 'CPU_INFOS=-1 \n'

then the fact that for some reasons ps was returning huge values and watchdog killed the job before it could be updated with a sensible value.

So I will initialize to zero instead.

ericvaandering commented 10 years ago

Comment by belforte on Tue Apr 23 06:38:34 2013

committed

/local/reps/CMSSW/COMP/CRAB/python/ScriptWriter.py,v <-- ScriptWriter.py new revision: 1.52; previous revision: 1.51

ericvaandering commented 10 years ago

Closed by belforte on Fri May 3 14:06:24 2013

ericvaandering commented 10 years ago

Comment by belforte on Fri May 3 14:06:24 2013

release in client CRAB_2_8_7

dmwm / CRAB2

in some cases crab wrapper reports CpuTime=-1 #901

TIME PID RSS(KB) VSZ(KB) Dsk(MB) tCPU(s) tWALL(s) COMMAND