Improve ILM error reporting in Index Management UI

cjcenizal commented 2 years ago

This issue resulted from a conversation with @imkarrer.

Problem

Long feedback loop

When there's an error with a lifecycle policy, the only way to you're notified of these errors is when you open Index Management and see an error callout. The only way to review the indices which are encountering ILM errors is to filter on ilm.step:ERROR and then click each index to see the information about the error.

How can we expose this information more immediately? The criteria for a good solution are:

The presence of a problem is immediately identifiable, for example on an Overview page.
The nature of the problem is surfaced directly in the table., e.g. the current phase of the ILM policy (https://github.com/elastic/kibana/issues/61119) and details about the error.
Actions for remediating the problem are surfaced directly in the table, e.g. triggering rollover (https://github.com/elastic/kibana/issues/64082) or clicking a "Copy" button next to the index name since remediation workflows commonly involve executing Index API requests in Console.
The ideal workflow consists of: seeing a list of all problems, trying a fix, and seeing the table update in response.

Unreliable feedback loop

The Explain Lifecycle API doesn't preserve the last known error state while it's re-attempting to apply a lifecycle change. This means that as lifecycles run, the Index Management table will intermittently show a number of indices with ILM errors, and then 0 indices with errors, and then a number of indices with errors again. This creates a literal moving target for an administrator attempting to fix these errors. The ideal workflow consists of: seeing a list of all problems, trying a fix, and seeing the table update in response.

One solution could consist of updating the Explain Lifecycle API to preserve the last known error state, and to surface that in the table instead. This could result in each index having two types of ILM state: Error state (error and no error) and Running state (running and not running).

elasticmachine commented 2 years ago

Pinging @elastic/platform-deployment-management (Team:Deployment Management)

elasticmachine commented 2 weeks ago

Pinging @elastic/kibana-management (Team:Kibana Management)

elastic / kibana