Open jasonrhodes opened 1 month ago
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)
I've spend some time thinking about this, and for now I can't find one issue with SLOs that can be solved by just resetting it. Most transform errors are coming from a badly shaped source data, e.g. a timestamp not being parseable for some reason, blocking the transform to go further. Or because the SLO query, and therefore the transform query is badly formatted.
Here's an example of such transform not parsing dates correctly (from overview qa):
I believe it can be good to surface the last few transform issues.details
on the health callout.
for now I can't find one issue with SLOs that can be solved by just resetting it
Gotcha, so we may need to pivot this work away from the reset fix and find another fix that doesn't mention transforms if at all possible, e.g. if the transform is failing because the source data is malformed, we should focus on that (SLO source data is malformed) with some guidance on how to fix the issue wherever possible. We may be able to sync up with the Logs+ effort (cc: @ruflin / @flash1293) to see if there is any way to connect into their "structure all logs" flow for guiding users through the process of "fixing" malformed documents.
Also interested in hearing other thoughts and ideas on how to surface the issue to the SLO user without mentioning the transform primitive directly.
I wonder if either the transform ES APIs or the SLO APIs could provide some kind of validation on whether an index is valid for use inside of an SLO transform? This might be related to "dataset quality" as well (cc @flash1293)
Thanks for the ping @jasonrhodes - this does indeed sound related to the dataset quality effort.
Some thoughts/questions:
logs-*-*
, and the user really needs to go out of their way for these docs to not have a valid timestamp._ignored
(a field couldn't be mapped but ignore_malformed
was set on it) and the failure store (a document failed to index completely). Failure store shouldn't be relevant here as you won't process these documents. For the documents with _ignored
fields, I'm not sure whether this is what causes SLO transforms to fail.Could you make a few concrete examples of what causes transforms to fail? Do we know the most common cases? If yes, we could see how it fits into the ongoing logs+ effort.
It seems a bit like "malformed documents" in this case means "malformed relative to the SLO configuration", which might be harder to handle by a general dataset quality page, but I would like to understand better the kind of issues we are dealing with here.
For that particular error shown above, this is an error coming from the source index when the transform attempts to parse the result of the search, and in this case the timestamp field. But I believe this is the only error that takes origins in the source data. The other errors are transform (ish) (probably more shards cluster related issue) related like this one:
Failed to execute phase [can_match], start; org.elasticsearch.action.search.SearchPhaseExecutionException: Search rejected due to missing shards [[.ds-metrics-apm.internal-default-2024.06.08-000030][1], [.ds-metrics-apm.service_transaction.1m-default-2024.06.07-000023][1], [.ds-metrics-apm.transaction.1m-default-2024.06.07-000024][1]]. Consider using `allow_partial_search_results` setting to bypass this error.
Validation Failed: 1: no such remote cluster: [metrics];2: no such remote cluster: [metrics];
some context from ML: https://elastic.slack.com/archives/C2AJLJHMM/p1718202853963529
Following up with the work done in https://github.com/elastic/kibana/pull/181351 to provide health information for SLOs that are in an unhealthy state, we should continue to evaluate how to improve this scenario for users.
Goals
Primarily, a user's first view of the "unhealthy" indicator shouldn't mention the underlying "transform" primitive at all. Users who don't know anything about the Elastic stack won't know what a transform is, so a message saying, "The following transform is in an unhealthy state" will be useless for that type of user.
Second, we don't want to completely block access to this transform information for a user who does manage their own stack, so we should try to provide a way to click through for more advanced error information where we can provide the transform ID, and possibly a link to the UI where they may be able to manage this transform, etc.
To this end, we are considering the following tasks: