dmwm / CRABServer

16 stars 38 forks source link

Report exit code in case of unexpected errors in postjob #4613

Closed mmascher closed 9 years ago

mmascher commented 9 years ago

From: https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/2254.html

In general this should be considered as an "internal error in the postjob script", the code is going in this fallback "except" clause https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/PostJob.py#L1065-L1069, and the postjob should be retried. We should probably record the error in the fjr so that the crab status will show it when the last attempt is retried, and probably create an exit code for postjob failures.

That said, afaik glidemon is reporting the exit code of the jobs taken from condor, so I believe it is really difficult to get these exit code. Different story for the dashboard since we can send a packet there.

mmascher commented 9 years ago

We should also send a packet to the dashboard in this case

belforte commented 9 years ago

But was the problem really something gone wrong in ASO land, or only a failure to look up status and hence everything was actually OK ? At some point we may want to review what glidemon does, but in any case it flags those jobs as failed now, so it does have some more information then the exit code from the WN.

mmascher commented 9 years ago

Only a failure to look up status, I did not check the transfer in ASO. Yes, I think glidemon read the states from the node_state file, but get the exit code from condor (so only the jobwrapper exit code)

belforte commented 9 years ago

then it would be trivial for glidemon to set exit code in this case to some non zero value. Let's not worry about it now, I am only saying that at some point it can easily be done w/o changing other things.

On 01/08/2015 03:35 PM, Marco Mascheroni wrote:

Yes, I think glidemon read the states from the node_state file, but get the exit code from condor (so only the jobwrapper exit code)

belforte commented 9 years ago

so chances are that crab report --dbs (or whatever the right syntax) would show that task il >90% complete ?

mmascher commented 9 years ago

yes, maybe. I have not tried. Will do it now.

mmascher commented 9 years ago

The user did not require publication, here there are his outputs: lcg-ls srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/store/user/tpfotzer/SingleMu/crab_SingleMu_Run2012C-22Jan2013-v1/141219_163408, but I am too lazy to check the number of files transfered :)

belforte commented 9 years ago

Old ticket, possibly very glidemon specific and/or covered by https://github.com/dmwm/CRABServer/issues/4615 Also Andres has done work on postJob exit codes. I am closing, and let's see if we still have status report problems