elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.13k forks source link

[ResponseOps] zombie task - task doc exists with enabled: true, but rule enabled: false #152957

Open pmuellr opened 1 year ago

pmuellr commented 1 year ago

stack version: 8.5.3

Describe the bug:

Somehow, a rule got marked as disabled - and had it's API key deleted - but the task document still existed. When the rule ran, it produced the message

Error: Rule failed to execute because rule ran after it was disabled.
    at loadRule (/x-pack/plugins/alerting/server/task_runner/rule_loader.js:40:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at TaskRunner.prepareToRun (/x-pack/plugins/alerting/server/task_runner task_runner.js:409:12)
    at TaskRunnerTimer.runWithTimer (/x-pack/plugins/alerting/server/task_runner/task_runner_timer.js:49:20)
    at TaskRunner.run (/x-pack/plugins/alerting/server/task_runner/task_runner.js:515:30)
    at TaskManagerRunner.run (/x-pack/plugins/task_manager/server/task_running/task_runner.js:266:22)

Looking at the task document, it has enabled: true, and the rule has enabled: false.

Seems we probably added this diagnostic for a case where a rule execution had started, but then the rule was disabled at some point during the execution, so we cancelled the execution.

Feels like we should have a different check, near the start of the execution, where we check to see if the rule is disabled, and if it is, set the task.enabled field to false in the task document - the rule should "win" if the task and rule differ in enablement, since the user can change the rule, but not the task.

It could still happen that the rule is disabled WHILE it's running, and then the message produced is fine. This feels like it's slightly different, in that we'd check at the very beginning of the run, and for these specific conditions (task enabled, rule disabled).

elasticmachine commented 1 year ago

Pinging @elastic/response-ops (Team:ResponseOps)

IanLee1521 commented 4 months ago

I just bumped into this myself in my 8.13.2 instance. The following was the Kibana UI alert message:

metrics.alert.threshold:6d774450-ad5d-11ed-838e-91a7b3d66137: execution failed - Rule failed to execute because rule ran after it was disabled.

Is there even a temporary / manual way to recover from this situation? Does one need to manually delete the task or something?

Matthew-Jenkins commented 1 month ago

I'm having this issue after upgrading to 8.13.4. Deleting the rule and reimporting the saved object doesn't fix it.

Matthew-Jenkins commented 1 month ago

Work around I found. Delete the rule. Wait for however long for the task to be run again and it to die from the rule being deleted. Then reimport the saved object. Now you can enable it.