Open ablnk opened 8 months ago
Pinging @elastic/apm-ui (Team:APM)
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window?
is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue?
trying to get a better understanding of the severity as it relates to boost window.
in qa the dependencies themselves comeback pretty quickly, but the sparklines are slower to load, but eventually do. i acknowledge the number of services and dependencies is far fewer than what was tested--again, trying to determine where the threshold is.
I've managed to reproduce the same problem in QA with ~100 dependencies
It could be that the changes that will be done as part of #178491 might solve this problem. I suspect that the histogram aggregation is slowing down the query.
@neptunian , the second option described in your comment, could be a more robust solution to prevent this from happening. As part of #178491 it could be worth it checking if the problem described in this ticket will be solved as well.
@crespocarlos with regards to the bucketing problem, could be worth trying out ES|QL here - bucketing is much more relaxed there. Although it's probably easier to separate the date histo buckets from the single search request.
However, if the bucket limit is the issue, you'd get an error describing it as such. It won't take down an Elasticsearch node, at least not in ES. This might be a different issue. Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?
Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?
I think this is necessary in determining what's causing the error. I'm not sure less buckets will solve it as querying large time ranges outside this "boost" window might still take too long, if that's the problem. I've asked in Slack channel about having APM data for the cluster.
@neptunian found it, the issue is twofold:
I have spoken to @crespocarlos about this, I would recommend to do a simple request to get the total amount of hits, and then based on that, calculate a sample rate that returns statistically significant results, and use the random_sampler agg if that sample rate is < 0.5. You will potentially lose the long tail of results but the alternative is a request that times out.
@dgieselaar ~how many hits could we consider as cutoff to use random_sampler
agg?~. Nvm, I understand now what you meant.
An example of what Dario described above in get_log_categories. We can follow this same idea.
@chrisdistasio, following up on @paulb-elastic comment. I was wondering if you have something in mind to help users understand eventual data loss due to the random-sampler aggregation usage.
I just want to highlight that the changes in https://github.com/elastic/kibana/pull/182828 might affect (depending on the amount of data + date range)
we have an analog for this someplace in services (IIRC). I'm trying to locate it in the UI. I would like to use consistent language if we can.
I've tested the fix in QA with 30-day range:
I'm reopening this because I'm seing intermittent circuit breaker errors. Perhaps the random sampler probability needs to be adjusted
It seems like the errors are caused by a transform https://elastic.slack.com/archives/C05UT5PP1EF/p1718023213609569
I'm reopening the issue because it is still reproducible, even within the boost window.
@crespocarlos I think we should just use ES|QL, it's way faster here. Hold me honest, I think they're equivalent:
I'm reopening the issue because it is still reproducible, even within the boost window.
@ablnk which env did you use to reproduce the problem?
I forgot the statistics 🤦 I updated the query with the failure rate, I cannot do the latency stats because of a type mismatch but I've added the failure rate stats (the type mismatch should be fixed as soon as ES|QL supports union types)
edit: works by type casting:
FROM metrics-apm*
| STATS
agent.name = MAX(agent.name),
span.type = MAX(span.type),
span.subtype = MAX(span.subtype),
avg_latency = SUM(span.destination.service.response_time.sum.us::long) / SUM(span.destination.service.response_time.count::long),
failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL))
BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource
| STATS
timeseries = VALUES(timestamp),
span.subtype = MAX(span.subtype),
span.type = MAX(span.type),
agent.name = MAX(agent.name)
BY service.name, span.destination.service.resource
| LIMIT 10000
@crespocarlos keep-serverless-qa
Version: Serverless project v 8.14.0
Description:
GET internal/apm/dependencies/top_dependencies
request fails with status code 502 and returnsbackend closed connection
when searching for top dependencies outside of the boost window.Preconditions: I reproduced the issue having 102 dependencies and 761 services.
Steps to reproduce:
Expected behavior: Dependencies available within 30 days returned.