elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.77k stars 8.17k forks source link

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

Open pmuellr opened 3 years ago

pmuellr commented 3 years ago

Task manager allows tasks to "cancel" themselves, if they run over their timeout, by calling a cancel() method in their runner. Tasks do not have to implement this though.

We're seeing a situation with a customer that involves a task manager task, which looks like has overlapping executions - in theory they should not be overlapping.

The task invocations are overlapping because the task exceeds it's timeout, but does not implement the cancel() method. Whether the task implements the method or not, task manager makes the task schedule-able. And because task manager can't actually STOP the task from running (it's just a function invocation, which you don't have control over like you do in Java with thread.kill() or whatever), the task keeps running. And so, overlapping task executions.

I repro'd this by hacking / instrumenting x-pack/plugins/alerting/server/health/task.ts as so:

hacked/instrumented alerting health task, changes marked with `/*!*/` ```ts /* * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one * or more contributor license agreements. Licensed under the Elastic License * 2.0; you may not use this file except in compliance with the Elastic License * 2.0. */ import { CoreStart, Logger } from 'kibana/server'; import { RunContext, TaskManagerSetupContract, TaskManagerStartContract, } from '../../../task_manager/server'; import { AlertsConfig } from '../config'; import { AlertingPluginsStart } from '../plugin'; import { HealthStatus } from '../types'; import { getHealth } from './get_health'; export const HEALTH_TASK_TYPE = 'alerting_health_check'; export const HEALTH_TASK_ID = `Alerting-${HEALTH_TASK_TYPE}`; export function initializeAlertingHealth( logger: Logger, taskManager: TaskManagerSetupContract, coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]> ) { registerAlertingHealthCheckTask(logger, taskManager, coreStartServices); } export async function scheduleAlertingHealthCheck( logger: Logger, config: Promise, taskManager: TaskManagerStartContract ) { try { const interval = (await config).healthCheck.interval; await taskManager.ensureScheduled({ id: HEALTH_TASK_ID, taskType: HEALTH_TASK_TYPE, schedule: { /*!*/ interval: '5s', }, state: {}, params: {}, }); } catch (e) { logger.debug(`Error scheduling task, received ${e.message}`); } } function registerAlertingHealthCheckTask( logger: Logger, taskManager: TaskManagerSetupContract, coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]> ) { taskManager.registerTaskDefinitions({ [HEALTH_TASK_TYPE]: { title: 'Alerting framework health check task', createTaskRunner: healthCheckTaskRunner(logger, coreStartServices), /*!*/ timeout: '2s', }, }); } export function healthCheckTaskRunner( logger: Logger, coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]> ) { /*!*/ let taskRunCount: number = 0; return ({ taskInstance }: RunContext) => { const { state } = taskInstance; return { /*!*/ async cancel() { /*!*/ console.log(`${new Date().toISOString()} - cancel() called`) /*!*/ }, async run() { taskRunCount++; const localTaskRunCount = taskRunCount; /*!*/ console.log(`${localTaskRunCount} - ${new Date().toISOString()} - starting healthCheck task`) try { const alertingHealthStatus = await getHealth( (await coreStartServices)[0].savedObjects.createInternalRepository(['alert']) ); /*!*/ console.log(`${localTaskRunCount} - ${new Date().toISOString()} - ending healthChecktask: success`) /*!*/ console.log(`${localTaskRunCount} - ${new Date().toISOString()} - but waiting another 10 seconds anyway`) /*!*/ await new Promise((resolve) => setTimeout(resolve, 10 * 1000)); /*!*/ console.log(`${localTaskRunCount} - ${new Date().toISOString()} - ending healthChecktask for reals`) return { state: { runs: (state.runs || 0) + 1, health_status: alertingHealthStatus.decryptionHealth.status, }, }; } catch (errMsg) { logger.warn(`Error executing alerting health check task: ${errMsg}`); /*!*/ console.log(`${localTaskRunCount} - ending healthCheck task: error`) return { state: { runs: (state.runs || 0) + 1, health_status: HealthStatus.Error, }, }; } }, }; }; } ```

This changes the task so the interval is 5s, the timeout is 2s, and the task will run for at least 10s.

you can see the overlap, in the log messages ``` 1 - 2021-03-31T19:02:23.119Z - starting healthCheck task 1 - 2021-03-31T19:02:23.136Z - ending healthChecktask: success 1 - 2021-03-31T19:02:23.136Z - but waiting another 10 seconds anyway server log [15:02:25.897] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:25.008Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:25.898Z - cancel() called 2 - 2021-03-31T19:02:29.123Z - starting healthCheck task 2 - 2021-03-31T19:02:29.135Z - ending healthChecktask: success 2 - 2021-03-31T19:02:29.135Z - but waiting another 10 seconds anyway server log [15:02:31.122] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:31.025Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:31.122Z - cancel() called 1 - 2021-03-31T19:02:33.129Z - ending healthChecktask for reals 3 - 2021-03-31T19:02:34.117Z - starting healthCheck task 3 - 2021-03-31T19:02:34.126Z - ending healthChecktask: success 3 - 2021-03-31T19:02:34.126Z - but waiting another 10 seconds anyway server log [15:02:36.019] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:36.017Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:36.019Z - cancel() called 2 - 2021-03-31T19:02:39.129Z - ending healthChecktask for reals 4 - 2021-03-31T19:02:40.127Z - starting healthCheck task 4 - 2021-03-31T19:02:40.137Z - ending healthChecktask: success 4 - 2021-03-31T19:02:40.137Z - but waiting another 10 seconds anyway server log [15:02:42.905] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:42.029Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:42.905Z - cancel() called 3 - 2021-03-31T19:02:44.126Z - ending healthChecktask for reals 5 - 2021-03-31T19:02:46.136Z - starting healthCheck task 5 - 2021-03-31T19:02:46.147Z - ending healthChecktask: success 5 - 2021-03-31T19:02:46.147Z - but waiting another 10 seconds anyway server log [15:02:48.906] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:48.036Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:48.907Z - cancel() called 4 - 2021-03-31T19:02:50.137Z - ending healthChecktask for reals 6 - 2021-03-31T19:02:52.116Z - starting healthCheck task 6 - 2021-03-31T19:02:52.127Z - ending healthChecktask: success 6 - 2021-03-31T19:02:52.127Z - but waiting another 10 seconds anyway server log [15:02:54.914] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:54.018Z after running for 00m 02s (with timeout set at 2s). 2021-03-31T19:02:54.914Z - cancel() called 5 - 2021-03-31T19:02:56.148Z - ending healthChecktask for reals ```

It seems to stabilize pretty quickly, never running more than 3 of these tasks at a time, after an hour or so running. Not sure if it might slowly grow, or other combinations of timeouts, intervals, and actual execution times would end up with a non-constant number of overlapping executions or not.

what can we do?

One thing we could do is FORCE task implementors to implement a cancel() method, by making it a required method. We should obviously provide a nice sample for this. Also note: we need to implement cancel() on alerts and actions: issue https://github.com/elastic/kibana/issues/64148 (I suspect this won't end up being a "nice sample").

Another option would be continue to allow non-cancellable tasks, but don't reschedule them till the original async task actually completes - I think that could be possible. We might want to do this even if we require a cancel() call.

@mikecote pointed out this related PR: https://github.com/elastic/kibana/pull/83682

elasticmachine commented 3 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr commented 3 years ago

Realizing there may be some other, related issues here, so just going to append them here, we can split out to separate issues later if needed.

issue: task manager heath status api does not indicate cancelled tasks

There's a place for it, I think, under stats.runtime.value.execution.result_frequency_percent_as_number, where you see the following objects, per task type:

{
  "Success": 100,
  "RetryScheduled": 0,
  "Failed": 0,
  "status": "OK"
}

For the tasks that were cancelled, I can still see execution times in the health stats that indicate they were very likely cancelled (execution time just exceeds timeout), but the frequency numbers in the object have Success: 100. Seems like we could add Cancelled here.

tsullivan commented 3 years ago

++ on making a cancel method required in the task definition

gmmorris commented 3 years ago

+1.

It seems too easy for a task to overrun its timeout, at which point we call cancel if it has been implemented, after which the task can be executed again. If cancel doesn't do anything (or isn't implemented) then that task can continue to run even after the timeout. This means that if the next scheduled execution is reached and the task runs again, then you can end up with more than one instance of that task running in parallel.

This seems to brittle considering the complexity of tasks we're now writing (async search, reporting etc.)

We should consider making this required and adding some kind of requirement/guideline as to what cancel should do, which we can then validate either via review or ideally, via automated testing of some kind. 🤔

cc @lizozom for async search, @tsullivan for reporting