dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Dashboard Reporting from WN #1692

Closed sfoulkes closed 11 years ago

sfoulkes commented 13 years ago

This allegedly isn't working (the reporting of the JobExitCode). We should take a look and see if there are any obvious problems. Also:

"Is it possible to send ExeExitCode in the end of every executable. For the moment we get ExeStart, ExeEnd with the name of th executable and time stamp which comes with ML for free, but there is no exit code of the excutable. If we want at some point to restore the whole execution chain of the job would be nice to have the exit code of every execution step. "

sfoulkes commented 13 years ago

sfoulkes: Also, we're evidently not increasing the resubmission count:

"If the job got resubmitted, the resubmission attempt number should be encreased, it should not be 0 all the time. This also causes the problem, though it is less urgent and serious than lack of JobExitCode report from the worker node. "

sfoulkes commented 13 years ago

sfoulkes: Also they're not getting performance information:

For some reason jobs stopped to report JobExitCode from the WN. They do not report ExeEnd, ExeExitCode in the end of every executable. They do not report CPU consumption,etc...

There is the Dashboard link which shows that only one instance of the WMAgent has all jobs with resolved exit status (success or not) - t1processing@cms-xen39.fnal.gov http://dashb-cms-job.cern.ch/dashboard/request.py/jobsummary#user=&site=&ce=&submissiontool=wmagent&dataset&application=&rb=&activity=reprocessing&grid=&date1=2011-05-26%2009%3A28%3A00&date2=2011-05-27%2009%3A28%3A00&sortby=submissionui&nbars=&scale=linear&jobtype=&tier=&check=terminated

All others have plenty of light green jobs, which means that those jobs were timeout by Dashboard , after being 24 hours in running status. This results in completely screwed up staistics for # of running jobs, since yesterday Dashboard counted ~20K running WMAgent jobs, when in practice it was less than 2K running jobs according to wmagent monitoring.

Even jobs submitted from t1processing@cms-xen39.fnal.gov did not report JobExitCode from the WN, but they reported it from the server. For other wmagent jobs we did not get reports that jobs finsihed nighter from the server nore from the WNs. It also looks to me that reporting from the server is very much delayed and incomplete.

Just one example. For the job 1152b8d6-86bd-11e0-9215-003048f1c5d0_0 we got a report from the WN that the executable cleanupUnmergedALCARECOStreamHcalCalHOCosmics started at 12:54:00 and finished at 12:54:12 25th of May, no JobExitCode was sent though. And then at 23:10:12 we got a report from the server that the job was submitted, no other info ever arrived for this job and there are tons of jobs like this one.

Would it be possible to find out why JobExitCode is not reported any more from the WNs for ALL wmagent instances. And why for almost all of them we did not get status of the jobs from the server as well, essentially we get only very delayed submission report and nothing else after this one.

DMWMBot commented 13 years ago

mnorman: Had a bool to int problem.

Shouldn't matter, since the JobExitCode sent from the WN needs to be overwritten by the one sent from the DashboardReporter, but it should at least not fail now.

sfoulkes commented 13 years ago

sfoulkes: (In 0c2720dd2306c76bc4737a7bd3ae94955d2069f2) Fix reporting of the JobExitCode from the WN. Fixes #1692.

From: Matt Norman mnorman@fnal.gov Signed-off-by: Steve Foulkes sfoulkes@fnal.gov

sfoulkes commented 13 years ago

sfoulkes: (In 12976) Fix reporting of the JobExitCode from the WN. Fixes #1692.

From: Matt Norman mnorman@fnal.gov Signed-off-by: Steve Foulkes sfoulkes@fnal.gov