dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Improve error handling for ErrorHandler :-) #5470

Closed amaltaro closed 9 years ago

amaltaro commented 10 years ago

Since central couch has moved to VMs, we started seeing ErrorHandler crashing quite often basically everywhere. We should improve its code to retry later instead of crash in case of problems communicating with central couch (probably ACDC database).

Just in case, this is the component traceback

2014-11-14 12:35:42,918:INFO:ErrorHandlerPoller:Starting to build ACDC with 30 jobs
2014-11-14 12:35:42,918:INFO:ErrorHandlerPoller:This operation will take some time...
2014-11-14 12:36:54,728:ERROR:ErrorHandlerPoller:Caught exception in ErrorHandler
Traceback (most recent call last):
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 377, in algorithm
    self.handleErrors()
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 311, in handleErrors
    self.handleRetryDoneJobs(jobList)
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 269, in handleRetryDoneJobs
    self.exhaustJobs(jobList)
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 134, in exhaustJobs
    self.handleACDC(jobList)
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 194, in handleACDC
    self.dataCollection.failedJobs(loadList)
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/CouchUtils.py", line 52, in wrapper
    return funcRef(x, *args, **opts)
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/ACDC/DataCollectionService.py", line 74, in failedJobs
    job.get("owner", "cmsdataops"))
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/ACDC/CouchService.py", line 71, in newOwner
    userInstance = makeUser(group, user, self.url, self.database)
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/GroupUser/User.py", line 93, in makeUser
    group.connect()
  File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/GroupUser/CouchObject.py", line 87, in connect
    raise CouchConnectionError(msg)
CouchConnectionError
amaltaro commented 9 years ago

Fixed a long time ago by #5313 . Closing this issue