Open mikecote opened 1 month ago
Pinging @elastic/response-ops (Team:ResponseOps)
After prototyping this (https://github.com/elastic/kibana/pull/190093), I did notice the tasks sometimes run more frequently than scheduled but it's much closer to the set number of tasks per minute.
Today, recurring tasks are rescheduled based on when they started running. This approach helps spread tasks over time when there is a large amount of task delay so the load onto the system is more constant instead of sporadic. It also avoids constant thundering herds when enabling many rules simultaneously or when Kibana has been down for a period of time.
In the image below, you can see this in action where downtime causes a series of tasks to queue up the next time Kibana runs, but over time, the number of requests start to even out in the middle.
However, using
startedAt
in the calculation always causes some delay between each execution because of the time it takes for the system to start running a task after it's been due to run. For example, a task running1m
would normally run every ~1m 3s
because the default poll interval of3s
contributes to the delay before tasks start running.To preserve task spreading when the system is overloaded while fixing the minor delays added after every run, we should use the
runAt
whenever the task runs within10s
of when it was due. This way, in normal scenarios, the tasks run as close to their schedule as possible.A code sample can be found in the file
task_runner.ts
https://github.com/elastic/kibana/pull/186972. We would need to add logic to apply this calculation only if the gap between runAt and startedAt is less than 10s.Definition of Done
startedAt - runAt
greater than 10s use the same calculation as exists todaystartedAt - runAt
less than 10s use the new calculation based onrunAt
insteadMath.max(..., Date.now());
)