elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.69k stars 8.24k forks source link

[APM] Internal Server Error returns when requesting trace sample of Elasticsearch dependency #195882

Open ablnk opened 1 month ago

ablnk commented 1 month ago

Description: GET /internal/apm/dependencies/charts/distribution?percentileThreshold=95&dependencyName=elasticsearch&spanName=<> request fails with status code 500 and returns "search_phase_execution_exception Caused by: circuit_breaking_exception" when requesting trace sample of Elasticsearch dependency.

Error
search_phase_execution_exception Caused by: circuit_breaking_exception: [parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html Root causes: task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html]] task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge 

The issue is only reproducible in a test LogsDB environment with search power set to 35.

Data in the search period:

Data view Docs
Logs 2,732,575,843
Metrics 601,310,529
Metrics - Kubernetes 261,604,394
APM 1,005,652,761

Logs

Steps to reproduce:

  1. Go to Applications - Dependencies.
  2. Select Elasticsearch dependency, then go to "Operations" tab.
  3. Set the search period to Last 7 Days or larger.
  4. Select the most impactful operation.
  5. Verify that a trace sample is loaded.

Expected behavior:

Trace sample is loaded.

elasticmachine commented 1 month ago

Pinging @elastic/apm-ui (Team:APM)

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

ablnk commented 1 week ago

Since ES client timeout has been increased, "Request timed out" error no longer reproduces. However, the described scenario doesn't work properly, in an environment with serverless.search.search_power_max: 35 I'm now getting circuit breaking exception (the search period set to Last 30 Days):

Error
search_phase_execution_exception Caused by: circuit_breaking_exception: [parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html Root causes: task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html]] task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge 

In an environment with serverless.search.search_power_max: 45 this not reproduces, the request completes with status 200 OK (requested data is returned) but Kibana doesn't render "waterfall" component, see the recording: Image

andrewvc commented 1 week ago

@ablnk would it be reasonable to expect that, regardless of search power, we may still get timeouts with more data?

chrisdistasio commented 1 week ago

I believe we've ruled out changing SP to 35 and are defaulting to a # of 45. @cachedout can you please confirm?

ablnk commented 1 week ago

@chrisdistasio this is more for awareness what you may encounter in the case of using SP35, not a candidate for hot fix since we're defaulting to SP45. @andrewvc I think so. Haven't tested what can happen if you set a really large periods like 90 days, assuming this is not a common use case.