Many "Has not updated the heartbeat in 300s!! It's possibly dead" warnings

ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management

GNU General Public License v3.0

100 stars 159 forks source link

Many "Has not updated the heartbeat in 300s!! It's possibly dead" warnings #247

Closed olupton closed 8 years ago

olupton commented 8 years ago

Hi,

This is only minor, but I'm using the current bugfix/LHCbfixes branch (I need fixes to Tasks that haven't been released...) and I see the following warning rather frequently, particularly when I have a "large" (O(100s)) number of running Dirac jobs:

Ganga.Core.MonitoringComponent     : WARNING  Thread: MonitoringWorker_4_1456847097361 Has not updated the heartbeat in 300s!! It's possibly dead
Ganga.Core.MonitoringComponent     : WARNING  Thread is attempting to execute: _checkBackend
Ganga.Core.MonitoringComponent     : WARNING  Thread: MonitoringWorker_4_1456847097361 Has not updated the heartbeat in 300s!! It's possibly dead
Ganga.Core.MonitoringComponent     : WARNING  Thread is attempting to execute: _checkBackend
Ganga.Core.MonitoringComponent     : WARNING  Thread: MonitoringWorker_4_1456847097361 Has not updated the heartbeat in 300s!! It's possibly dead
Ganga.Core.MonitoringComponent     : WARNING  Thread is attempting to execute: _checkBackend

Let me know if any other information would be helpful.

Cheers, Olli

rob-c commented 8 years ago

Hi Olli, Are you observing either the Dirac monitoring lockup or any other job processes (tasks, queues) freeze?

If everything is running as expected this could just be that the timeout of 300sec if a bit too short due to long tasks running as part of the backend code.

I think I'll expand the debug information to contain more than _checkBackend as it's not certain which backend has locked up due to this warning, but I suspect it's the Dirac one.

Thanks,

Rob

olupton commented 8 years ago

Hi Rob,

I haven't noticed any significant freezes; just a lot of warnings.

The ganga session in question doesn't have any jobs that don't use the Dirac backend.

Cheers, Olli

rob-c commented 8 years ago

OK, in that case I'll work on improving the debugging later today as a start on this.

milliams commented 8 years ago

I've got some ideas for better heuristics on the monitoring heartbeat such as checking the thread current stack and seeing if it's hanging on the same line for a long period. This won't help in all cases but it will help reduce the false-positives.

egede commented 8 years ago

@rob-c Can you remind @olupton which config parameter can be used to extend the 300 s. At least in that way the verbosity of the warnings can be turned down.

rob-c commented 8 years ago

@milliams If you're certain that the heuristics don't get too complicated/require too much time to develop/debug I'd say great. But I'm more keen to have something fool proof here that works or doesn't.

@olupton Increasing HeartBeatTimeOut in .gangarc you can push back most of these annoying messages. https://github.com/ganga-devs/ganga/blob/develop/python/Ganga/Core/MonitoringComponent/Local_GangaMC_Service.py#L72

olupton commented 8 years ago

@rob-c Thanks; I tried doubling it (to 600s) and I still see the warning. It's hard to judge whether or not it is appearing less frequently.

I'm not sure if it's relevant, but the jobs that are running in that Ganga session are simple PyROOT ones, so if the "heartbeat" is something that comes from Gaudi/some application framework then it might really not be being updated.

rob-c commented 8 years ago

This could well be from a non-Dirac backend, I'm struggling to reproduce this locally from basic testing, I'll try more tomorrow.

rob-c commented 8 years ago

Do you know does this turn up if you're just submitting lots of jobs on the tasks system possibly?

olupton commented 8 years ago

There aren't any non-Dirac jobs in the repository, that I'm sure of.

At the moment I am seeing the warnings with O(40) 20-subjob jobs running. A new warning appears every O(10s). One thing that I hadn't noticed before is that they're all exactly the same (currently MonitoringWorker_0_1456871140871); I had thought the number was changing but this seems not to be the case.

drmarkwslater commented 8 years ago

Do we want to close this because of #269 being merged?

olupton commented 8 years ago

Just to say that I updated to the head of the bugfix/LHCbfixes branch a few minutes ago, and I am still seeing the error (probably because my config explicitly sets the timeout), but the error has changed a bit:

Ganga.Core.MonitoringComponent     : WARNING  Thread: MonitoringWorker_1_1457097322984 Has not updated the heartbeat in 600s!! It's possibly dead
Ganga.Core.MonitoringComponent     : WARNING  Thread is attempting to execute: unknown
Ganga.Core.MonitoringComponent     : WARNING  With arguments: ()

drmarkwslater commented 8 years ago

Yes indeed - if you're specifically setting the timeout then you won't have had these messages disappear. We do still need to investigate why they occur, but certainly at the moment they don't seem to be harmful.