elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.15k stars 4.91k forks source link

[Heartbeat] Scheduler deadline exceeded message is confusing #23189

Open andrewvc opened 3 years ago

andrewvc commented 3 years ago

The warning for tasks that miss their deadline is confusing in https://github.com/elastic/beats/blob/master/heartbeat/scheduler/scheduler.go#L152 . It currently reads:

"%d tasks have missed their schedule deadlines in the last %s."

It's really unclear to users what's going on here (to the point I'm labeling this abug), we should make it more friendly, something like:

%d tasks are running behind schedule (previous run not finished when next one is already due to run).

We also should document troubleshooting this somewhere. There are a number of different potential causes, and it's too much to put in an error message.

  1. Constrained scheduler limits for too many monitors (if you have 1000 monitors that each take a second to execute on a 30s interval, and a schedule that constrains us to execute at most 2 at a time, after 60s only 120 monitors will have run, causing the rest to be behind).
  2. Heartbeat is actually resource constrained (same as above, but we're just hitting real limits, not artificial ones)
  3. A timeout value exceeding the schedule interval (if you check a resource with the default timeout of 16s every 5s, and it takes 10s to run we'll miss a deadline since we don't overlap checks of the same monitor)

In the case of the last point, I'm also wondering if we should just suppress the message, because it's very likely you'll hit this state, but it won't actually be an error.

Additionally, we should consider listing the specific monitors that are in this state to help users debug this issue.

botelastic[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

TheRiffRafi commented 1 year ago

Hello Team,

I am reopening this as it is still needed, we continue to receive reports of these type of errors and we aren't sure on how to proceed as there are multiple reasons that could cause this message.

botelastic[bot] commented 2 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!