elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.13k forks source link

[Task Manager] Improve scheduling of recurring tasks so they re-run more on-time #189114

Open mikecote opened 1 month ago

mikecote commented 1 month ago

Today, recurring tasks are rescheduled based on when they started running. This approach helps spread tasks over time when there is a large amount of task delay so the load onto the system is more constant instead of sporadic. It also avoids constant thundering herds when enabling many rules simultaneously or when Kibana has been down for a period of time.

In the image below, you can see this in action where downtime causes a series of tasks to queue up the next time Kibana runs, but over time, the number of requests start to even out in the middle. image

However, using startedAt in the calculation always causes some delay between each execution because of the time it takes for the system to start running a task after it's been due to run. For example, a task running 1m would normally run every ~1m 3s because the default poll interval of 3s contributes to the delay before tasks start running.

To preserve task spreading when the system is overloaded while fixing the minor delays added after every run, we should use the runAt whenever the task runs within 10s of when it was due. This way, in normal scenarios, the tasks run as close to their schedule as possible.

A code sample can be found in the file task_runner.ts https://github.com/elastic/kibana/pull/186972. We would need to add logic to apply this calculation only if the gap between runAt and startedAt is less than 10s.

Definition of Done

elasticmachine commented 1 month ago

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote commented 1 month ago

After prototyping this (https://github.com/elastic/kibana/pull/190093), I did notice the tasks sometimes run more frequently than scheduled but it's much closer to the set number of tasks per minute.