Closed ticoann closed 12 years ago
mmascher: I tried this patch over the 0.7.12 tag but, as you can see in the log, the WorkQueueManagerWorkPoller still terminates, and no more work is taken from the global workqueue afterwards.
{{{ 2011-07-26 04:49:13,346:INFO:WorkQueueManagerWMBSFileFeeder:0 of units of work acquired for file creation 2011-07-26 04:49:21,484:ERROR:WorkQueue:408 Request Timeout occured, Investigate: HTP/0 K ontent-Length: 2^M Expires: Tue, 26 Jul 2011 02:54:11 GMT^M Server: CherryPy/3.1.2^M Etag: 15744094587142094^M Date: Tue, 26 Jul 2011 02:49:11 GMT^M Content-Type: application/json^M ^M {}HTTP/1.1 408 Request Timeout^M Content-Length: 0^M Content-Type: text/plain^M ^M cannot parse JSON description: HTP/0 K ontent-Length: 2^M Expires: Tue, 26 Jul 2011 02:54:11 GMT^M Server: CherryPy/3.1.2^M Etag: 15744094587142094^M Date: Tue, 26 Jul 2011 02:49:11 GMT^M Content-Type: application/json^M ^M
{}HTTP/1.1 408 Request Timeout^M Content-Length: 0^M Content-Type: text/plain^M ^M
2011-07-26 04:49:21,497:ERROR:BaseWorkerThread:Error in worker algorithm (1): Backtrace: <WMComponent.WorkQueueManager.WorkQueueManagerWorkPoller.WorkQueueManagerWorkPoller instance at 0xdd3ab48> cannot parse JSON description: T1.1 20O^MC File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/WorkerThreads/BaseWorkerThread.py", line 168, in call self.algorithm(parameters) File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMComponent/WorkQueueManager/WorkQueueManagerWorkPoller.py", line 35, in algorithm work = self.queue.pullWork() File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/WorkQueue/WorkQueue.py", line 791, in pullWork team) File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/Services/WorkQueue/WorkQueue.py", line 89, in getWork return self._getResult(callname, args = args, verb = "POST") File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/Services/WorkQueue/WorkQueue.py", line 62, in _getResult result = self.decoder(result) File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/Services/WorkQueue/WorkQueue.py", line 36, in jsonThunkerDecoder return thunker.unthunk(JsonWrapper.loads(data)) File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/Wrappers/JsonWrapper/init.py", line 40, in loads raise ex
2011-07-26 04:49:21,501:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.WorkQueueManager.WorkQueueManagerWorkPoller.WorkQueueManagerWorkPoller instance at 0xdd3ab48> cannot parse JSON description: T1.1 20O^MC Backtrace: File "/data/MarcoAndrea/WMCore/0.7.12/src/python/WMCore/WorkerThreads/BaseWorkerThread.py", line 192, in call raise ex
2011-07-26 04:49:21,501:INFO:BaseWorkerThread:Worker thread <WMComponent.WorkQueueManager.WorkQueueManagerWorkPoller.WorkQueueManagerWorkPoller instance at 0xdd3ab48> terminated }}}
swakef: I guess the problem is either the workqueue rest server not correctly returning a response or httplib2 getting confused. httplib2 has had issues with 408 codes before that we have had to work around so its possible there is still an edge case we havent caught.
The other disturbing bit is {{{cannot parse JSON description: T1.1 20O^MC}}} which looks like its trying to decode the http response including headers.
In short this seems to be quite messed up.
If its the server sending a dodgy reply that will go away with 08x, can you check the WorkQueueService logs, particualrly the stderr one to check if saw an error?
Also, the httplib2 cache files should have been deleted but its worth checking WorkQueueManager/.cache (i think it might then be .wmcore_cache under there) for a cache file for that response. (you might need to grep on the url to find a file containing it)
Lastly, we need to stop the thread crashing when it hits these errors, the hope would then be that at least further polling cycles would succeed (and if not at least we get something reproducible to debug). I'll prepare a patch for both 07x and 08x that does that.
Thinking about this some more, I think we might be hitting an issue with threads, I seem to remember when changing the code in 08x I changed the code from sharing a workqueue object to creating a separate one per thread. That would explain why the responses seem mixed up. This is probably patchable, i'll take a look, but i'd sugest thinking about moving to 08x.
mmascher: I checked the stderr of the WorkQueueService and I did not see anything. In the cache directory I see some empty files. Some of them are called [...]_POST_getwork, where getwork is the API called.
When I restarted the local workqueue everything was fine, and the request was digested by the JobCreator. Therefore, I think that some fallback try excpet which prevents the thread crashing should be enough till a switch to the 08X. Then we will see if that happen again :)
ewv: I ran into this issue too and Seangchan's patch wasn't quite enough to fix it since it wasn't always finding the "408" string in the corrupted I/O. Removing that check and just retrying on any corrupt JSON did seem to fix it for me.
sryu: Replying to [comment:2 swakef]:
I guess the problem is either the workqueue rest server not correctly returning a response or httplib2 getting confused. httplib2 has had issues with 408 codes before that we have had to work around so its possible there is still an edge case we havent caught.
The other disturbing bit is {{{cannot parse JSON description: T1.1 20O^MC}}} which looks like its trying to decode the http response including headers.
In short this seems to be quite messed up.
If its the server sending a dodgy reply that will go away with 08x, can you check the WorkQueueService logs, particualrly the stderr one to check if saw an error?
Also, the httplib2 cache files should have been deleted but its worth checking WorkQueueManager/.cache (i think it might then be .wmcore_cache under there) for a cache file for that response. (you might need to grep on the url to find a file containing it)
Lastly, we need to stop the thread crashing when it hits these errors, the hope would then be that at least further polling cycles would succeed (and if not at least we get something reproducible to debug). I'll prepare a patch for both 07x and 08x that does that.
Thinking about this some more, I think we might be hitting an issue with threads, I seem to remember when changing the code in 08x I changed the code from sharing a workqueue object to creating a separate one per thread. That would explain why the responses seem mixed up. This is probably patchable, i'll take a look, but i'd sugest thinking about moving to 08x.
I think Stuart is right. I tested individual calls and never had the problem so far. And when it happens it was always mixup with synchronize call hearder and getWork call results. Probably they are sharing the object and mixed two streams together.
Sometimes getWork call has trailing string of 408 Error message with valid json result. That cause the thread termination.I am suspecting some how the server response got mixed with consecutive synchronize call which has a legitmate timeout errors. Here is the example of faulty response. I attached temporary fix until I found real cause. Steve, please review. This is only for 0_7_x branch
{{{ MCTest_110720_020855", "mask_url": "http://vocms144.cern.ch:9996/workqueueservice/workqueue/wf/52204.mask", "wmtask_type": "Production", "element_id": 52204, "wmtask_name": "Production"}, {"dbs_url": null, "owner": "None/cmsdataops/cmsd ataops/cmsdataops/cmsdataops", "team_name": "mc", "id": 2138, "request_name": "cmsdataops_Backfill_110716_03_T1_US_FNAL_MCTest_110720_020855", "url": "http://vocms144.cern.ch:9996/workqueueservice/workqueue/wf/cmsdataops_Backfill_110716_0 3_T1_US_FNAL_MCTest_110720_020855.pkl", "wmspec_name": "cmsdataops_Backfill_110716_03_T1_US_FNAL_MCTest_110720_020855", "mask_url": "http://vocms144.cern.ch:9996/workqueueservice/workqueue/wf/52205.mask", "wmtask_type": "Production", "ele ment_id": 52205, "wmtask_name": "Production"}]HTTP/1.1 408 Request Timeout^M Content-Length: 0^M Content-Type: text/plain^M
}}}