Open pmuellr opened 3 years ago
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
Realizing there may be some other, related issues here, so just going to append them here, we can split out to separate issues later if needed.
issue: task manager heath status api does not indicate cancelled tasks
There's a place for it, I think, under stats.runtime.value.execution.result_frequency_percent_as_number
, where you see the following objects, per task type:
{
"Success": 100,
"RetryScheduled": 0,
"Failed": 0,
"status": "OK"
}
For the tasks that were cancelled, I can still see execution times in the health stats that indicate they were very likely cancelled (execution time just exceeds timeout), but the frequency numbers in the object have Success: 100
. Seems like we could add Cancelled
here.
++ on making a cancel
method required in the task definition
+1.
It seems too easy for a task to overrun its timeout
, at which point we call cancel
if it has been implemented, after which the task can be executed again.
If cancel
doesn't do anything (or isn't implemented) then that task can continue to run even after the timeout.
This means that if the next scheduled execution is reached and the task runs again, then you can end up with more than one instance of that task running in parallel.
This seems to brittle considering the complexity of tasks we're now writing (async search, reporting etc.)
We should consider making this required and adding some kind of requirement/guideline as to what cancel should do, which we can then validate either via review or ideally, via automated testing of some kind. 🤔
cc @lizozom for async search, @tsullivan for reporting
Task manager allows tasks to "cancel" themselves, if they run over their timeout, by calling a
cancel()
method in their runner. Tasks do not have to implement this though.We're seeing a situation with a customer that involves a task manager task, which looks like has overlapping executions - in theory they should not be overlapping.
The task invocations are overlapping because the task exceeds it's timeout, but does not implement the
cancel()
method. Whether the task implements the method or not, task manager makes the task schedule-able. And because task manager can't actually STOP the task from running (it's just a function invocation, which you don't have control over like you do in Java withthread.kill()
or whatever), the task keeps running. And so, overlapping task executions.I repro'd this by hacking / instrumenting x-pack/plugins/alerting/server/health/task.ts as so:
hacked/instrumented alerting health task, changes marked with `/*!*/`
```ts /* * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one * or more contributor license agreements. Licensed under the Elastic License * 2.0; you may not use this file except in compliance with the Elastic License * 2.0. */ import { CoreStart, Logger } from 'kibana/server'; import { RunContext, TaskManagerSetupContract, TaskManagerStartContract, } from '../../../task_manager/server'; import { AlertsConfig } from '../config'; import { AlertingPluginsStart } from '../plugin'; import { HealthStatus } from '../types'; import { getHealth } from './get_health'; export const HEALTH_TASK_TYPE = 'alerting_health_check'; export const HEALTH_TASK_ID = `Alerting-${HEALTH_TASK_TYPE}`; export function initializeAlertingHealth( logger: Logger, taskManager: TaskManagerSetupContract, coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]> ) { registerAlertingHealthCheckTask(logger, taskManager, coreStartServices); } export async function scheduleAlertingHealthCheck( logger: Logger, config: PromiseThis changes the task so the interval is
5s
, the timeout is2s
, and the task will run for at least10s
.you can see the overlap, in the log messages
``` 1 - 2021-03-31T19:02:23.119Z - starting healthCheck task 1 - 2021-03-31T19:02:23.136Z - ending healthChecktask: success 1 - 2021-03-31T19:02:23.136Z - but waiting another 10 seconds anyway server log [15:02:25.897] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:25.008Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:25.898Z - cancel() called 2 - 2021-03-31T19:02:29.123Z - starting healthCheck task 2 - 2021-03-31T19:02:29.135Z - ending healthChecktask: success 2 - 2021-03-31T19:02:29.135Z - but waiting another 10 seconds anyway server log [15:02:31.122] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:31.025Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:31.122Z - cancel() called 1 - 2021-03-31T19:02:33.129Z - ending healthChecktask for reals 3 - 2021-03-31T19:02:34.117Z - starting healthCheck task 3 - 2021-03-31T19:02:34.126Z - ending healthChecktask: success 3 - 2021-03-31T19:02:34.126Z - but waiting another 10 seconds anyway server log [15:02:36.019] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:36.017Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:36.019Z - cancel() called 2 - 2021-03-31T19:02:39.129Z - ending healthChecktask for reals 4 - 2021-03-31T19:02:40.127Z - starting healthCheck task 4 - 2021-03-31T19:02:40.137Z - ending healthChecktask: success 4 - 2021-03-31T19:02:40.137Z - but waiting another 10 seconds anyway server log [15:02:42.905] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:42.029Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:42.905Z - cancel() called 3 - 2021-03-31T19:02:44.126Z - ending healthChecktask for reals 5 - 2021-03-31T19:02:46.136Z - starting healthCheck task 5 - 2021-03-31T19:02:46.147Z - ending healthChecktask: success 5 - 2021-03-31T19:02:46.147Z - but waiting another 10 seconds anyway server log [15:02:48.906] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:48.036Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:48.907Z - cancel() called 4 - 2021-03-31T19:02:50.137Z - ending healthChecktask for reals 6 - 2021-03-31T19:02:52.116Z - starting healthCheck task 6 - 2021-03-31T19:02:52.127Z - ending healthChecktask: success 6 - 2021-03-31T19:02:52.127Z - but waiting another 10 seconds anyway server log [15:02:54.914] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:54.018Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:54.914Z - cancel() called 5 - 2021-03-31T19:02:56.148Z - ending healthChecktask for reals ```It seems to stabilize pretty quickly, never running more than 3 of these tasks at a time, after an hour or so running. Not sure if it might slowly grow, or other combinations of timeouts, intervals, and actual execution times would end up with a non-constant number of overlapping executions or not.
what can we do?
One thing we could do is FORCE task implementors to implement a
cancel()
method, by making it a required method. We should obviously provide a nice sample for this. Also note: we need to implementcancel()
on alerts and actions: issue https://github.com/elastic/kibana/issues/64148 (I suspect this won't end up being a "nice sample").Another option would be continue to allow non-cancellable tasks, but don't reschedule them till the original async task actually completes - I think that could be possible. We might want to do this even if we require a
cancel()
call.@mikecote pointed out this related PR: https://github.com/elastic/kibana/pull/83682