[Meta] [SLO] Health status management for SLOs

jasonrhodes commented 1 month ago

Following up with the work done in https://github.com/elastic/kibana/pull/181351 to provide health information for SLOs that are in an unhealthy state, we should continue to evaluate how to improve this scenario for users.

Goals

Primarily, a user's first view of the "unhealthy" indicator shouldn't mention the underlying "transform" primitive at all. Users who don't know anything about the Elastic stack won't know what a transform is, so a message saying, "The following transform is in an unhealthy state" will be useless for that type of user.

Second, we don't want to completely block access to this transform information for a user who does manage their own stack, so we should try to provide a way to click through for more advanced error information where we can provide the transform ID, and possibly a link to the UI where they may be able to manage this transform, etc.

To this end, we are considering the following tasks:

### Tasks
- [ ] Update the initial "unhealthy" banners to not mention transforms, and replace that information with a button to "reset this SLO configuration" (wording TBD) — this may need some investigation as to when exactly we can suggest this remediation (for some error messages and not others, perhaps?) and whether we can/should trigger recreating these transforms from the Kibana UI
- [ ] Once we understand which scenarios are good candidates to suggest a reset of the SLO (recreating the underlying transforms), we should then consider whether it makes more sense to just auto-reset them in some or all of these scenarios.
- [ ] Allow drilling into the health status indicator for more information where we can provide the transform ID, underlying error messages, etc.
- [ ] If a transform is in an unhealthy state and we aren't going to auto-reset it for the customer, in the advanced error information section we should explore whether we can link the user to a UI in Kibana where they can further explore the transform's health, possibly resetting it from an ML transforms UI
- [ ] https://github.com/elastic/kibana/issues/178853

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

kdelemme commented 4 weeks ago

I've spend some time thinking about this, and for now I can't find one issue with SLOs that can be solved by just resetting it. Most transform errors are coming from a badly shaped source data, e.g. a timestamp not being parseable for some reason, blocking the transform to go further. Or because the SLO query, and therefore the transform query is badly formatted.

Here's an example of such transform not parsing dates correctly (from overview qa):

GET _transforms/id/_stats response

```json { "count": 1, "transforms": [ { "id": "slo-1b986e40-40ef-11ee-bf0a-bf025fb10c44-3", "state": "failed", "reason": """Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [unable to parse date [1702842480000]]. Other failures: [IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException: java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""", "node": { "id": "y8gOqOFBQpm4-S0n8vXXVA", "name": "instance-0000000074", "ephemeral_id": "RVfSa3VYT922kqn99Ps6AQ", "transport_address": "172.23.178.115:19308", "attributes": {} }, "stats": { "pages_processed": 0, "documents_processed": 0, "documents_indexed": 0, "documents_deleted": 0, "trigger_count": 0, "index_time_in_ms": 0, "index_total": 0, "index_failures": 0, "search_time_in_ms": 0, "search_total": 0, "search_failures": 0, "processing_time_in_ms": 0, "processing_total": 0, "delete_time_in_ms": 0, "exponential_avg_checkpoint_duration_ms": 0, "exponential_avg_documents_indexed": 0, "exponential_avg_documents_processed": 0 }, "checkpointing": { "last": { "checkpoint": 0 }, "next": { "checkpoint": 1, "checkpoint_progress": { "docs_indexed": 0, "docs_processed": 0 }, "timestamp_millis": 1705486447618, "time_upper_bound_millis": 1705486380000 }, "operations_behind": 32002121 }, "health": { "status": "red", "issues": [ { "type": "transform_task_failed", "issue": "Transform task state is [failed]", "details": """Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [unable to parse date [1702842480000]]. Other failures: [IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException: java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""", "count": 1 } ] } } ] } ```

I believe it can be good to surface the last few transform issues.details on the health callout.

jasonrhodes commented 3 weeks ago

for now I can't find one issue with SLOs that can be solved by just resetting it

Gotcha, so we may need to pivot this work away from the reset fix and find another fix that doesn't mention transforms if at all possible, e.g. if the transform is failing because the source data is malformed, we should focus on that (SLO source data is malformed) with some guidance on how to fix the issue wherever possible. We may be able to sync up with the Logs+ effort (cc: @ruflin / @flash1293) to see if there is any way to connect into their "structure all logs" flow for guiding users through the process of "fixing" malformed documents.

Also interested in hearing other thoughts and ideas on how to surface the issue to the SLO user without mentioning the transform primitive directly.

jasonrhodes commented 3 weeks ago

I wonder if either the transform ES APIs or the SLO APIs could provide some kind of validation on whether an index is valid for use inside of an SLO transform? This might be related to "dataset quality" as well (cc @flash1293)

flash1293 commented 3 weeks ago

Thanks for the ping @jasonrhodes - this does indeed sound related to the dataset quality effort.

Some thoughts/questions:

What type of errors are common for the SLO transforms? I'm not too sure about the example above with the timestamp - the dataset quality page is only looking at logs-*-*, and the user really needs to go out of their way for these docs to not have a valid timestamp.
The dataset quality page is built on two primitives to "detect" issues: _ignored (a field couldn't be mapped but ignore_malformed was set on it) and the failure store (a document failed to index completely). Failure store shouldn't be relevant here as you won't process these documents. For the documents with _ignored fields, I'm not sure whether this is what causes SLO transforms to fail.

Could you make a few concrete examples of what causes transforms to fail? Do we know the most common cases? If yes, we could see how it fits into the ongoing logs+ effort.

It seems a bit like "malformed documents" in this case means "malformed relative to the SLO configuration", which might be harder to handle by a general dataset quality page, but I would like to understand better the kind of issues we are dealing with here.

kdelemme commented 3 weeks ago

For that particular error shown above, this is an error coming from the source index when the transform attempts to parse the result of the search, and in this case the timestamp field. But I believe this is the only error that takes origins in the source data. The other errors are transform (ish) (probably more shards cluster related issue) related like this one:

Failed to execute phase [can_match], start; org.elasticsearch.action.search.SearchPhaseExecutionException: Search rejected due to missing shards [[.ds-metrics-apm.internal-default-2024.06.08-000030][1], [.ds-metrics-apm.service_transaction.1m-default-2024.06.07-000023][1], [.ds-metrics-apm.transaction.1m-default-2024.06.07-000024][1]]. Consider using `allow_partial_search_results` setting to bypass this error.

Validation Failed: 1: no such remote cluster: [metrics];2: no such remote cluster: [metrics];

kdelemme commented 3 weeks ago

some context from ML: https://elastic.slack.com/archives/C2AJLJHMM/p1718202853963529

elastic / kibana

[Meta] [SLO] Health status management for SLOs #184971

Goals