This issue resulted from a conversation with @imkarrer.
Problem
Long feedback loop
When there's an error with a lifecycle policy, the only way to you're notified of these errors is when you open Index Management and see an error callout. The only way to review the indices which are encountering ILM errors is to filter on ilm.step:ERROR and then click each index to see the information about the error.
How can we expose this information more immediately? The criteria for a good solution are:
The presence of a problem is immediately identifiable, for example on an Overview page.
Actions for remediating the problem are surfaced directly in the table, e.g. triggering rollover (https://github.com/elastic/kibana/issues/64082) or clicking a "Copy" button next to the index name since remediation workflows commonly involve executing Index API requests in Console.
The ideal workflow consists of: seeing a list of all problems, trying a fix, and seeing the table update in response.
Unreliable feedback loop
The Explain Lifecycle API doesn't preserve the last known error state while it's re-attempting to apply a lifecycle change. This means that as lifecycles run, the Index Management table will intermittently show a number of indices with ILM errors, and then 0 indices with errors, and then a number of indices with errors again. This creates a literal moving target for an administrator attempting to fix these errors. The ideal workflow consists of: seeing a list of all problems, trying a fix, and seeing the table update in response.
One solution could consist of updating the Explain Lifecycle API to preserve the last known error state, and to surface that in the table instead. This could result in each index having two types of ILM state: Error state (error and no error) and Running state (running and not running).
This issue resulted from a conversation with @imkarrer.
Problem
Long feedback loop
When there's an error with a lifecycle policy, the only way to you're notified of these errors is when you open Index Management and see an error callout. The only way to review the indices which are encountering ILM errors is to filter on
ilm.step:ERROR
and then click each index to see the information about the error.How can we expose this information more immediately? The criteria for a good solution are:
Unreliable feedback loop
The Explain Lifecycle API doesn't preserve the last known error state while it's re-attempting to apply a lifecycle change. This means that as lifecycles run, the Index Management table will intermittently show a number of indices with ILM errors, and then 0 indices with errors, and then a number of indices with errors again. This creates a literal moving target for an administrator attempting to fix these errors. The ideal workflow consists of: seeing a list of all problems, trying a fix, and seeing the table update in response.
One solution could consist of updating the Explain Lifecycle API to preserve the last known error state, and to surface that in the table instead. This could result in each index having two types of ILM state: Error state (
error
andno error
) and Running state (running
andnot running
).