elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.61k stars 8.22k forks source link

[ResponseOps] Discrepancy between the rule state and task manager when performing bulkEnable or bulkDisable rules #192207

Open js-jankisalvi opened 2 months ago

js-jankisalvi commented 2 months ago

Related to: https://github.com/elastic/kibana/issues/181050

When enabling rules, the Alerting framework skips rules if their saved object has alert.attributes.enabled: true. This behavior creates issues described here and in the attached SDH.

There might be a situation where a rule is marked as enabled, but no corresponding task exists in the Task Manager. In the UI, these rules will appear enabled but will never run. Users expect that all rules affected by the bulk enable action will get a corresponding task created in the Task Manager and be scheduled for execution. Therefore, it would be best to also check if the rules to be enabled have tasks in the Task Manager, instead of relying solely on the rule's current enabled state.

elasticmachine commented 2 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 2 months ago

From a comment in a previous attempted PR: https://github.com/elastic/kibana/pull/189041 ...

In order to make the double update (rule and task) more resilient, we determined that this should be the order of updates:

analysis Trying to capture what happens when a request to enable/disable a task is run, but one of the updates to the task doc or the rule doc fails - in this case, it's the second one (task updated then rule => task doc updated, then rule doc updated). Two tables for each of enable / disable, showing the difference in the ordering of the updates (rule then task, or task then rule). The first column is the input state of the task/doc, and the second column is the final state after the failed update. The mismatched input states would be the result of a bad enable/disable update, or some other bad thing that happened. Final state of "task: !enabled, rule: enabled" should be avoided if possible. The rule will look enabled in the UX, but actually will not be running. These final states have an `- XXX` appended to them. Final state of "task: enabled, rule: !enabled" is acceptable. In this case, the rule is the source of truth for enablement, so when the task runs, it will check if the rule is enabled. If it isn't, the rule execution code will instead disable the task, so it won't be polled for till re-enabled. Even if that update fails, we'll get it again the next time. #### enable: task updated then rule update fails | input state | rule update fails | |---------------------------------|---------------------------------| | task: enabled, rule: enabled | task: enabled, rule: enabled | | task: !enabled, rule: !enabled | task: enabled, rule: !enabled | | task: enabled, rule: !enabled | task: enabled, rule: !enabled | | task: !enabled, rule: enabled | task: enabled, rule: enabled | --- #### enable: rule updated then task update fails | input state | task update fails | |---------------------------------|---------------------------------| | task: enabled, rule: enabled | task: enabled, rule: enabled | | task: !enabled, rule: !enabled | task: !enabled, rule: enabled - XXX | | task: enabled, rule: !enabled | task: enabled, rule: enabled | | task: !enabled, rule: enabled | task: !enabled, rule: enabled - XXX | --- #### disable: task updated then rule update fails | input state | rule update fails | |---------------------------------|---------------------------------| | task: enabled, rule: enabled | task: !enabled, rule: enabled - XXX | | task: !enabled, rule: !enabled | task: !enabled, rule: !enabled | | task: enabled, rule: !enabled | task: !enabled, rule: !enabled | | task: !enabled, rule: enabled | task: !enabled, rule: enabled - XXX | --- disable: rule updated then task update fails | input state | rule update fails | |---------------------------------|---------------------------------| | task: enabled, rule: enabled | task: enabled, rule: !enabled | | task: !enabled, rule: !enabled | task: !enabled, rule: !enabled | | task: enabled, rule: !enabled | task: enabled, rule: !enabled | | task: !enabled, rule: enabled | task: !enabled, rule: !enabled |
pmuellr commented 2 months ago

In one of the previous related PRs, I believe I noticed we did NOT have a test where we created a rule in a disabled state, and then did a bulk update of it. We should add a FT for this case, and anything similar/related. You can repro this today, manually, by importing a rule that has been exported - it will be disabled on import (with no backing task doc). You can then enable it via the rule list, which will use the bulk enable/disable functionality.

pmuellr commented 1 month ago

Assuming this PR gets merged, it may change the way we fix this problem: Add bulk update function that directly updates using the esClient #191760.

Since this allows a partial update, what I'm wondering is if we just blast the enabled field in the task docs, as appropriate, without any OCC. Previously we were doing a full update (essentially). We were thinking we would need to OCC a bunch of the task doc updates, but not really sure that we do. We have conflict resolution in the task claimer, but I don't think there's really a need for it here, for the tasks anyway.