Since central couch has moved to VMs, we started seeing ErrorHandler crashing quite often basically everywhere. We should improve its code to retry later instead of crash in case of problems communicating with central couch (probably ACDC database).
Just in case, this is the component traceback
2014-11-14 12:35:42,918:INFO:ErrorHandlerPoller:Starting to build ACDC with 30 jobs
2014-11-14 12:35:42,918:INFO:ErrorHandlerPoller:This operation will take some time...
2014-11-14 12:36:54,728:ERROR:ErrorHandlerPoller:Caught exception in ErrorHandler
Traceback (most recent call last):
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 377, in algorithm
self.handleErrors()
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 311, in handleErrors
self.handleRetryDoneJobs(jobList)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 269, in handleRetryDoneJobs
self.exhaustJobs(jobList)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 134, in exhaustJobs
self.handleACDC(jobList)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 194, in handleACDC
self.dataCollection.failedJobs(loadList)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/CouchUtils.py", line 52, in wrapper
return funcRef(x, *args, **opts)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/ACDC/DataCollectionService.py", line 74, in failedJobs
job.get("owner", "cmsdataops"))
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/ACDC/CouchService.py", line 71, in newOwner
userInstance = makeUser(group, user, self.url, self.database)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/GroupUser/User.py", line 93, in makeUser
group.connect()
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc5_amd64_gcc461/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/GroupUser/CouchObject.py", line 87, in connect
raise CouchConnectionError(msg)
CouchConnectionError
Since central couch has moved to VMs, we started seeing ErrorHandler crashing quite often basically everywhere. We should improve its code to retry later instead of crash in case of problems communicating with central couch (probably ACDC database).
Just in case, this is the component traceback