Open ablnk opened 1 month ago
Pinging @elastic/apm-ui (Team:APM)
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
Since ES client timeout has been increased, "Request timed out" error no longer reproduces. However, the described scenario doesn't work properly, in an environment with serverless.search.search_power_max: 35
I'm now getting circuit breaking exception (the search period set to Last 30 Days):
Error
search_phase_execution_exception Caused by: circuit_breaking_exception: [parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html Root causes: task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html]] task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge
In an environment with serverless.search.search_power_max: 45
this not reproduces, the request completes with status 200 OK (requested data is returned) but Kibana doesn't render "waterfall" component, see the recording:
@ablnk would it be reasonable to expect that, regardless of search power, we may still get timeouts with more data?
I believe we've ruled out changing SP to 35 and are defaulting to a # of 45. @cachedout can you please confirm?
@chrisdistasio this is more for awareness what you may encounter in the case of using SP35, not a candidate for hot fix since we're defaulting to SP45. @andrewvc I think so. Haven't tested what can happen if you set a really large periods like 90 days, assuming this is not a common use case.
Description: GET
/internal/apm/dependencies/charts/distribution?percentileThreshold=95&dependencyName=elasticsearch&spanName=<>
request fails with status code 500 and returns "search_phase_execution_exception Caused by: circuit_breaking_exception" when requesting trace sample of Elasticsearch dependency.The issue is only reproducible in a test LogsDB environment with search power set to 35.
Data in the search period:
Logs
Steps to reproduce:
Expected behavior:
Trace sample is loaded.