Closed olupton closed 8 years ago
Hi Olli, Are you observing either the Dirac monitoring lockup or any other job processes (tasks, queues) freeze?
If everything is running as expected this could just be that the timeout of 300sec if a bit too short due to long tasks running as part of the backend code.
I think I'll expand the debug information to contain more than _checkBackend
as it's not certain which backend has locked up due to this warning, but I suspect it's the Dirac one.
Thanks,
Rob
Hi Rob,
I haven't noticed any significant freezes; just a lot of warnings.
The ganga session in question doesn't have any jobs that don't use the Dirac backend.
Cheers, Olli
OK, in that case I'll work on improving the debugging later today as a start on this.
I've got some ideas for better heuristics on the monitoring heartbeat such as checking the thread current stack and seeing if it's hanging on the same line for a long period. This won't help in all cases but it will help reduce the false-positives.
@rob-c Can you remind @olupton which config parameter can be used to extend the 300 s. At least in that way the verbosity of the warnings can be turned down.
@milliams If you're certain that the heuristics don't get too complicated/require too much time to develop/debug I'd say great. But I'm more keen to have something fool proof here that works or doesn't.
@olupton Increasing HeartBeatTimeOut
in .gangarc you can push back most of these annoying messages. https://github.com/ganga-devs/ganga/blob/develop/python/Ganga/Core/MonitoringComponent/Local_GangaMC_Service.py#L72
@rob-c Thanks; I tried doubling it (to 600s) and I still see the warning. It's hard to judge whether or not it is appearing less frequently.
I'm not sure if it's relevant, but the jobs that are running in that Ganga session are simple PyROOT ones, so if the "heartbeat" is something that comes from Gaudi/some application framework then it might really not be being updated.
This could well be from a non-Dirac backend, I'm struggling to reproduce this locally from basic testing, I'll try more tomorrow.
Do you know does this turn up if you're just submitting lots of jobs on the tasks system possibly?
There aren't any non-Dirac jobs in the repository, that I'm sure of.
At the moment I am seeing the warnings with O(40) 20-subjob jobs running. A new warning appears every O(10s). One thing that I hadn't noticed before is that they're all exactly the same (currently MonitoringWorker_0_1456871140871
); I had thought the number was changing but this seems not to be the case.
Do we want to close this because of #269 being merged?
Just to say that I updated to the head of the bugfix/LHCbfixes branch a few minutes ago, and I am still seeing the error (probably because my config explicitly sets the timeout), but the error has changed a bit:
Ganga.Core.MonitoringComponent : WARNING Thread: MonitoringWorker_1_1457097322984 Has not updated the heartbeat in 600s!! It's possibly dead
Ganga.Core.MonitoringComponent : WARNING Thread is attempting to execute: unknown
Ganga.Core.MonitoringComponent : WARNING With arguments: ()
Yes indeed - if you're specifically setting the timeout then you won't have had these messages disappear. We do still need to investigate why they occur, but certainly at the moment they don't seem to be harmful.
Hi,
This is only minor, but I'm using the current bugfix/LHCbfixes branch (I need fixes to Tasks that haven't been released...) and I see the following warning rather frequently, particularly when I have a "large" (O(100s)) number of running Dirac jobs:
Let me know if any other information would be helpful.
Cheers, Olli