dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

HeartbeatMonitor global workqueue cherrypy thread crashing with timeout #9669

Open amaltaro opened 4 years ago

amaltaro commented 4 years ago

Impact of the bug Global Workqueue and its monitoring in MonIT

Describe the bug Same introduction as of https://github.com/dmwm/WMCore/issues/9668

With the huge amount of documents in GQ, it turns out that the way we monitor global workqueue activity is pretty bad, since it has to pull all the documents from CouchDB, parse all of them and come up with nice metrics to be published to MonIT.

The problem is, right now, those >300k documents are resulting in almost 0.5GB of data that has to be pulled down to the Global Workqueue backend. Which, of course, doesn't manage to go through within 270secs (frontends limit it to 300secs).

How to reproduce it Creating hundreds of thousands of GQEs(?)

Expected behavior The obvious one, the HeartbeatMonitor cherrypy thread should not crash. We definitely need to revisit this logic as well, rely more on CouchDB views instead of fetching the whole documents. I still don't have a solution in mind, but we need to make it a lightweight monitoring service.

Additional context and error message Whole traceback from heartbeatMonitor-vocms0740-20200429.log

2020-04-29 17:42:41,680:INFO:HeartbeatMonitorBase:Checking Thread status...
2020-04-29 17:42:41,740:INFO:HeartbeatMonitor:Collecting GlobalWorkqueue statistics...
2020-04-29 17:42:41,742:INFO:LogDB:<LogDB(url=https://cmsweb.cern.ch/couchdb/wmstats_logdb, identifier=global_workqueue, agent=1)>
2020-04-29 17:47:12,198:ERROR:CherryPyPeriodicTask:Periodic Thread ERROR pycurl.error (28, 'Operation timed out after 270000 milliseconds with 453095385 bytes received')
2020-04-29 17:47:12,198:ERROR:CherryPyPeriodicTask: Traceback (most recent call last):
2020-04-29 17:47:12,198:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/REST/CherryPyPeriodicTask.py", line 91, in run
2020-04-29 17:47:12,198:ERROR:CherryPyPeriodicTask:     self.taskFunc(self.config)
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/REST/HeartbeatMonitorBase.py", line 36, in reportToWMStats
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:     monitorInfo = self.addAdditionalMonitorReport(config)
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/GlobalWorkQueue/CherryPyThreads/HeartbeatMonitor.py", line 23, in addAdditionalMonitorReport
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:     results = globalQ.monitorWorkQueue(self.initialStatus)
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1178, in monitorWorkQueue
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:     elements = self.workqueueDS.getElementsByStatus(status)
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Services/WorkQueue/WorkQueue.py", line 389, in getElementsByStatus
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:     data = db.loadView('WorkQueue', 'elementsByStatus', options, status)
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Database/CMSCouch.py", line 531, in loadView
2020-04-29 17:47:12,199:ERROR:CherryPyPeriodicTask:     (self.name, design, view, data), {'keys': keys})
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 135, in post
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:     encode, decode, contentType)
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Database/CMSCouch.py", line 122, in makeRequest
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:     encode, decode, contentType)
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 165, in makeRequest
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:     result, response = self.makeRequest_pycurl(uri, data, verb, headers)
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 183, in makeRequest_pycurl
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:     ckey=ckey, cert=cert, capath=capath)
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:   File "/data/srv/beHG2004f/sw/slc7_amd64_gcc630/cms/workqueue/1.3.2.pre5/lib/python2.7/site-packages/WMCore/Services/pycurl_manager.py", line 257, in request
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask:     curl.perform()
2020-04-29 17:47:12,200:ERROR:CherryPyPeriodicTask: error: (28, 'Operation timed out after 270000 milliseconds with 453095385 bytes received')
amaltaro commented 4 years ago

For the record, after talking to Imran, I have temporarily increased the polling cycle of this thread from 10min to actually 120min (vocms0740)