elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.7k stars 8.24k forks source link

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

Open ablnk opened 8 months ago

ablnk commented 8 months ago

Version: Serverless project v 8.14.0

Description: GET internal/apm/dependencies/top_dependencies request fails with status code 502 and returns backend closed connection when searching for top dependencies outside of the boost window.

Preconditions: I reproduced the issue having 102 dependencies and 761 services.

Steps to reproduce:

  1. Go to Applications - Dependencies.
  2. Filter data by last 30 days.

Expected behavior: Dependencies available within 30 days returned.

elasticmachine commented 8 months ago

Pinging @elastic/apm-ui (Team:APM)

elasticmachine commented 8 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

chrisdistasio commented 7 months ago

@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window?

is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue?

trying to get a better understanding of the severity as it relates to boost window.

in qa the dependencies themselves comeback pretty quickly, but the sparklines are slower to load, but eventually do. i acknowledge the number of services and dependencies is far fewer than what was tested--again, trying to determine where the threshold is.

crespocarlos commented 7 months ago

I've managed to reproduce the same problem in QA with ~100 dependencies

Image

It could be that the changes that will be done as part of #178491 might solve this problem. I suspect that the histogram aggregation is slowing down the query.

@neptunian , the second option described in your comment, could be a more robust solution to prevent this from happening. As part of #178491 it could be worth it checking if the problem described in this ticket will be solved as well.

dgieselaar commented 7 months ago

@crespocarlos with regards to the bucketing problem, could be worth trying out ES|QL here - bucketing is much more relaxed there. Although it's probably easier to separate the date histo buckets from the single search request.

However, if the bucket limit is the issue, you'd get an error describing it as such. It won't take down an Elasticsearch node, at least not in ES. This might be a different issue. Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?

neptunian commented 7 months ago

Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?

I think this is necessary in determining what's causing the error. I'm not sure less buckets will solve it as querying large time ranges outside this "boost" window might still take too long, if that's the problem. I've asked in Slack channel about having APM data for the cluster.

dgieselaar commented 7 months ago

@neptunian found it, the issue is twofold:

I have spoken to @crespocarlos about this, I would recommend to do a simple request to get the total amount of hits, and then based on that, calculate a sample rate that returns statistically significant results, and use the random_sampler agg if that sample rate is < 0.5. You will potentially lose the long tail of results but the alternative is a request that times out.

crespocarlos commented 7 months ago

@dgieselaar ~how many hits could we consider as cutoff to use random_sampler agg?~. Nvm, I understand now what you meant.

crespocarlos commented 7 months ago

An example of what Dario described above in get_log_categories. We can follow this same idea.

crespocarlos commented 6 months ago

@chrisdistasio, following up on @paulb-elastic comment. I was wondering if you have something in mind to help users understand eventual data loss due to the random-sampler aggregation usage.

I just want to highlight that the changes in https://github.com/elastic/kibana/pull/182828 might affect (depending on the amount of data + date range)

chrisdistasio commented 6 months ago

we have an analog for this someplace in services (IIRC). I'm trying to locate it in the UI. I would like to use consistent language if we can.

crespocarlos commented 5 months ago

I've tested the fix in QA with 30-day range:

Image

crespocarlos commented 5 months ago

I'm reopening this because I'm seing intermittent circuit breaker errors. Perhaps the random sampler probability needs to be adjusted

crespocarlos commented 5 months ago

It seems like the errors are caused by a transform https://elastic.slack.com/archives/C05UT5PP1EF/p1718023213609569

ablnk commented 2 months ago

I'm reopening the issue because it is still reproducible, even within the boost window.

dgieselaar commented 2 months ago

@crespocarlos I think we should just use ES|QL, it's way faster here. Hold me honest, I think they're equivalent:

ES|QL request (2.5s) ```json POST _query? { "query": """ FROM metrics-apm* | STATS MAX(agent.name), MAX(span.type), MAX(span.subtype), failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL)) BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource | STATS VALUES(timestamp) BY service.name, span.destination.service.resource | LIMIT 10000 """, "filter": { "bool": { "filter": [ { "terms": { "processor.event": [ "metric" ] } }, { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": [ { "terms": { "metricset.interval": [ "10m", "60m" ] } } ] } } ] } }, { "bool": { "must_not": [ { "terms": { "_tier": [] } } ] } } ], "must": [ { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": [ { "terms": { "metricset.interval": [ "10m", "60m" ] } } ] } }, { "range": { "@timestamp": { "gte": "now-7d", "lte": "now", "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs", "otlp/webjs" ] } } ] } } ] } } ] } } } ```
_search request (11s) ```json POST metrics-apm*/_search?request_cache=false { "track_total_hits": true, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "metric" ] } }, { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": [ { "terms": { "metricset.interval": [ "10m", "60m" ] } } ] } } ] } }, { "bool": { "must_not": [ { "terms": { "_tier": [] } } ] } } ], "must": [ { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": [ { "terms": { "metricset.interval": [ "10m", "60m" ] } } ] } }, { "range": { "@timestamp": { "gte": "now-7d", "lte": "now", "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs", "otlp/webjs" ] } } ] } } ] } } ] } }, "aggs": { "connections": { "composite": { "size": 1500, "sources": [ { "serviceName": { "terms": { "field": "service.name" } } }, { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "service.environment" }, { "field": "agent.name" }, { "field": "span.type" }, { "field": "span.subtype" } ], "sort": { "@timestamp": "desc" } } }, "total_latency_sum": { "sum": { "field": "span.destination.service.response_time.sum.us" } }, "total_latency_count": { "sum": { "field": "span.destination.service.response_time.count" } }, "timeseries": { "date_histogram": { "field": "@timestamp", "fixed_interval": "43200s", "extended_bounds": { "min": "now-7d", "max": "now" } }, "aggs": { "latency_sum": { "sum": { "field": "span.destination.service.response_time.sum.us" } }, "count": { "sum": { "field": "span.destination.service.response_time.count" } }, "event.outcome": { "terms": { "field": "event.outcome" }, "aggs": { "count": { "sum": { "field": "span.destination.service.response_time.count" } } } } } } } } } } ```
crespocarlos commented 2 months ago

I'm reopening the issue because it is still reproducible, even within the boost window.

@ablnk which env did you use to reproduce the problem?

dgieselaar commented 2 months ago

I forgot the statistics 🤦 I updated the query with the failure rate, I cannot do the latency stats because of a type mismatch but I've added the failure rate stats (the type mismatch should be fixed as soon as ES|QL supports union types)

edit: works by type casting:

FROM metrics-apm*
    | STATS
        agent.name = MAX(agent.name),
        span.type = MAX(span.type),
        span.subtype = MAX(span.subtype),
        avg_latency = SUM(span.destination.service.response_time.sum.us::long) / SUM(span.destination.service.response_time.count::long),
        failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL))
        BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource
    | STATS
        timeseries = VALUES(timestamp),
        span.subtype = MAX(span.subtype),
        span.type = MAX(span.type),
        agent.name = MAX(agent.name)
        BY service.name, span.destination.service.resource
    | LIMIT 10000
ablnk commented 2 months ago

@crespocarlos keep-serverless-qa