Open ablnk opened 8 months ago
Pinging @elastic/apm-ui (Team:APM)
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
I wonder if this is unique to serverless or if the same challenge exists in ESS? Can we determine this?
As a Tech Preview feature, we can prioritize this investigation lower than GA features like Service Map
@chrisdistasio for context, the feature is built on a scripted_metric aggregation which we kind of expect to break down in some cases. The issues are similar to the service map, in the sense that we need to look at trace events and cannot use aggregated metrics, and thus its performance characteristics become unpredictable. We can build in some safe guards to get it to GA though. Happy to help out if needed.
@chrisdistasio issue is not unique to serverless, just reproduced it stateful deployment too.
Hey @dgieselaar what guardrails you have in mind for this?
This might be related to #181790
Reiterating the points made above, Traces Explorer is in Tech Preview and this affects both serverless and stateful - as such, this is a lower priority.
@paulb-elastic FWIW, the concerns around service maps were not "this breaks for our users", but "this can take down a cluster and page the ES team and it isn't their responsibility". I think the same applies here, but the risk is lower due the fact it is not enabled by default. I assume the ES team still wants a fix for this as well though.
@chrisdistasio I think @dgieselaar 's comment above answers your question from earlier about whether or not there is currently a mechanism in elasticsearch to guard itself against being taken down when handling our requests and that we need to build for that on our end.
I can't really think of an alternative other than rewriting the query not to use scripted_metrics
aggregation. @dgieselaar @neptunian do you think the investigation conducted here https://github.com/elastic/kibana/issues/179229#issuecomment-2163872017 would help in this scenario too?
Perhaps the fact that it still in technical preview it also gives us more flexibility to rewrite this feature to not use scripted_metrics aggs, provided that there will be performance gains in doing so.
I'm constantly experiencing another issue where, for the same time range, sometimes the data is returned by the server, while other times it keeps loading forever or returns empty.
As per the discussion with @chrisdistasio, this won't be tackled right now, but moved to the backlog
Version: Serverless project v 8.14.0 Stateful deployment v 8.14.0-SNAPSHOT
Description:
POST /internal/apm/traces/aggregated_critical_path
request returnInternal Server Error
.Preconditions: I reproduced the issue having ~780k documents in APM data view within 15 minutes interval from 761 services.
Steps to reproduce:
Expected behavior: Data presentation should be rendered.
Screenshots:
https://github.com/elastic/kibana/assets/34958359/0f201a48-e600-492f-8ce5-31ec25f19859
Response: