dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

JobAccountant vocms0310.cern.ch went down twice Monday August 28 #8105

Closed scarletnorberg closed 6 years ago

scarletnorberg commented 7 years ago

https://its.cern.ch/jira/projects/CMSCOMPPR/issues/CMSCOMPPR-1218?filter=addedrecently

Went it down twice today very close together.

Here is the log: <@---------- WMException End ----------@> File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 179, in call self.algorithm(parameters) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py", line 88, in algorithm raise JobAccountantPollerException(msg) 2017-08-28 03:45:45,632:140274881804032:INFO:Harness:>>>Terminating worker threads 2017-08-28 03:45:45,654:140274881804032:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller instance at 0x7f944a815320> <@========== WMException Start ==========@> Exception Class: JobAccountantPollerException Message: Hit general exception in JobAccountantPoller while using worker. 'utf8' codec can't decode byte 0xd0 in position 56427: invalid continuation byte ModuleName : WMComponent.JobAccountant.JobAccountantPoller MethodName : algorithm ClassInstance : None FileName : /data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py ClassName : None LineNumber : 88 ErrorNr : 0 Traceback: File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py", line 68, in algorithm self.accountantWorker(jobsSlice) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 292, in call self.stateChanger.propagate(self.listOfJobsToFail, "jobfailed", "complete") File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/JobStateMachine/ChangeState.py", line 181, in propagate self.recordInCouch(jobs, newstate, oldstate, updatesummary) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/JobStateMachine/ChangeState.py", line 452, in recordInCouch self.fwjrdatabase.commit(callback = discardConflictingDocument) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/Database/CMSCouch.py", line 281, in commit retval = self.post(uri, data) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 121, in post encode, decode, contentType) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/Database/CMSCouch.py", line 120, in makeRequest encode, decode, contentType) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 149, in makeRequest encoder, decoder, contentType) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 229, in makeRequest_httplib encoded_data = self.encode(data) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 563, in encode return encoder.encode(thunked) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/external/python/2.7.13/lib/python2.7/json/encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/external/python/2.7.13/lib/python2.7/json/encoder.py", line 270, in iterencode return _iterencode(o, 0) <@---------- WMException End ----------@> Backtrace: File "/data/srv/wmagent/v1.1.4.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.4.patch2/lib/python2.7/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 205, in call raise ex 2017-08-28 03:45:45,654:140274881804032:INFO:BaseWorkerThread:Worker thread <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller instance at 0x7f944a815320> terminated

ticoann commented 7 years ago

patched #8058 and restarted the component. I think that is the same issue. Let me know if this still crashes.

amaltaro commented 7 years ago

Thanks for updating the twiki page too. If you agree, I'm in favor of patching agents as they hit this issue?

ticoann commented 7 years ago

Alan, yes I agree with you. We will patch the agent when problem hits. We need to have better patch.

amaltaro commented 6 years ago

Also fixed by https://github.com/dmwm/WMCore/pull/8247