elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.52k stars 8.07k forks source link

[APM] Service map request fails when searching outside of the boost window #179229

Closed ablnk closed 3 weeks ago

ablnk commented 4 months ago

Version: Serverless project v 8.14.0

Description: GET internal/apm/service-map request fails with status code 502 and returns backend closed connection when filtering data outside of the boost window.

Preconditions: I reproduced the issue having ~154 dependencies and 1011 services.

Steps to reproduce:

  1. Go to Applications - Services.
  2. Navigate to Service Map tab and filter data outside of the boost window.

Expected behavior: Service map data returned for set time interval.

Logs:

Error decoding JSON: json: cannot unmarshal string into Go value of type map[string]interface {}
Error decoding JSON: unexpected EOF

Log messages look somewhat fragmentary:

image

elasticmachine commented 4 months ago

Pinging @elastic/apm-ui (Team:APM)

elasticmachine commented 4 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

chrisdistasio commented 3 months ago

does this always fail outside the boost window? for example, does it consistently fail one day after the boost window or does it fail 12 days after the boost window? does the amount of data/complexity of the service map matter?

ablnk commented 3 months ago

@chrisdistasio having the data that I specified in a precondition, it was always failing when searching outside the boost window. The amount of data matter, especially the number of dependencies.

chrisdistasio commented 3 months ago

@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window?

is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue?

trying to get a better understanding of the severity as it relates to boost window.

for example, i have trouble getting a response in qa for 15 mins - https://keepserverless-qa-oblt-ba6694.kb.eu-west-1.aws.qa.elastic.cloud/app/apm/service-map?comparisonEnabled=false&environment=ENVIRONMENT_ALL&kuery=&offset=1d&rangeFrom=now-15m&rangeTo=now&serviceGroup=

smith commented 3 months ago

@chrisdistasio this is prioritized in our refining queue on our internal backlog, so the first steps are determining under which conditions this happens.

crespocarlos commented 3 months ago

@ablnk Is it possible to access the environment where the error is happening? I'd like to have a closer look to the logs.

crespocarlos commented 2 months ago

I'm getting the 502 in QA deployment with ~1800 services and ~170 dependencies, within 15m range, which I believe is still within the 7d boost window.

Image

The map is somewhat complex (5m range)

Image

I managed to crash the server when I used a 1d range.

Looks like it's related to the amount of data affecting it on serverless. Searches outside the boost window could potentialize the likelihood of this error happening.

crespocarlos commented 2 months ago

The solution for this might be related to https://github.com/elastic/observability-serverless/issues/18 and #181790

neptunian commented 2 months ago

@chrisdistasio @paulb-elastic Are we aiming for the short term or the long term solution here? The short term being we keep the scripted metric aggregation, but make it faster. The tradeoff here is the service map's accuracy or completeness as we could lose some connections and discovered services in the timerange, though we already do some sampling of the trace ids we pass in so this exists to an extent already, I believe.

The long term is removing the scripted aggregation and moving this processing somewhere else, not sure where.

paulb-elastic commented 2 months ago

The steer I've been given is to focus on the shorter term solution for now (although happy to hear @chrisdistasio's opinion too).

neptunian commented 2 months ago

Thanks @paulb-elastic. I've been having some conversation in another issue, and will get started on optimizing the scripted metric agg and running some tests. Then figure out what this limit number should be which is the more challenging bit as Dario mentioned.

neptunian commented 2 months ago

I had a chat with Dario the week before last and we aligned on a strategy to avoid OOMs.

optimization strategy

tradeoffs

issues to resolve / dependencies

paulb-elastic commented 2 months ago

Thanks for the update @neptunian

what should the default number be for the new setting that limits events to be processed? will need support from ES

Are you ok proceeding with this, and not blocked?

neptunian commented 1 month ago

@paulb-elastic Yes, not blocked. I'll be focusing on it this week.

neptunian commented 1 month ago

I had a chat with @nik9000 about the strategy above. We came to the conclusion that limiting in the reduce phase likely won't resolve the error. Once a large number of documents have arrived at the reduce phase there are already memory issues whether or not we limit adding them to the eventsById hash and processing them. With the current allowance of 100,000 documents per shard and an unknown number of shards, the total volume of data being returned can be extremely large. Unfortunately there is no easy way for to know how many shards a cluster has. I can use the search shards API and count them but that could be a very large response. I believe this error I get on serverless is due to the amount of docs:

Screenshot 2024-06-05 at 11 00 55 AM

Instead, I think we should tweak the limits with the levers we already have in place such as decreasing terminate_after from 100k per shard, and less max traces (currently 1,000);

One thing I've found out after talking to @crespocarlos about his PR https://github.com/elastic/kibana/pull/183802, is the synthtrace data that exists in serverless where we are testing is not realistic data. One reason why things are so slow and we might be getting OOM errors is that we are actually running the synthtrace scenario for OOMs:

Here is a response i get on the edge cluster, some of the docs are from OOM scenario:

Screenshot 2024-06-05 at 2 23 40 PM

This scenario creates very long traces (not common or realistic) in order to trigger an OOM. Sometimes when I can get it to load I have a neverending long trace. This is the service map below. You cannot see the services because I've zoomed out to show how long this trace can be.

Screenshot 2024-06-05 at 2 19 51 PM

Given the fact there is a recursive function in the reduce_phase we can end up with an OOM in there because of that or a stack overflow due to the depth of the trace.

I think we should verify the data we are using to test with is realistic and not broken before pulling any levers. It's also very difficult to test otherwise.

dgieselaar commented 1 month ago

@neptunian fwiw, you can count the number of documents by executing a request that only counts the documents, but does not execute the aggregation. For the second request, the filter cache will kick in and it will run the aggregation on the collected documents. Unfortunately that is a hard stop, and you cannot partially process traces with that strategy. You could chunk up the documents differently until you get a number of documents that make sense, but it's probably more sensible to nudge the user to tweak these settings if possible (?).

Re:

This scenario creates very long traces (not common or realistic) in order to trigger an OOM.

this is the whole point no? It's not about realistic data, it's about making sure that no data at all can trigger an OOM.

neptunian commented 1 month ago

but it's probably more sensible to nudge the user to tweak these settings if possible

@dgieselaar Do you mean the settings for serviceMapTerminateAfter, serviceMapMaxTracesPerRequest, serviceMapMaxTraces?

this is the whole point no? It's not about realistic data, it's about making sure that no data at all can trigger an OOM.

I guess I'm assuming that if we were trying to make sure no data could trigger an OOM we wouldn't have recursive functions in the service map and waterfall that don't have any safeguards to account for a trace being that long. Unless that is just an oversight. When I run the service_map_oom scenario I typically get a run time error in kibana logs caused by stack_overflow_error. I thought these scenarios were more geared towards integration testing edge cases. I'm just not sure we should be optimizing for all of the cases. Even if we add a safeguard for these very long traces it can make the queries much slower than they would otherwise be. I do think it's useful to have it all in there, though, but I think we and those testing on these clusters for performance need to be aware that there's an expectation things could be slower.

dgieselaar commented 1 month ago

@dgieselaar Do you mean the settings for serviceMapTerminateAfter, serviceMapMaxTracesPerRequest, serviceMapMaxTraces?

yes (or any other settings we add). I think the fewer settings we have the better, but not sure what is practical.

I guess I'm assuming that if we were trying to make sure no data could trigger an OOM we wouldn't have recursive functions in the service map and waterfall that don't have any safeguards to account for a trace being that long. Unless that is just an oversight.

When you say "service map" and "waterfall" are you talking about browser code? If so, it is something we have not optimized for. This data (spans or descendants that refer to each other as parents) is corrupted, and we can't guard against every imaginable form of data corruption. However, an OOM for an Elasticsearch node has much more devastating consequences (like paging Nik at night).

When I run the service_map_oom scenario I typically get a run time error in kibana logs caused by stack_overflow_error. Yes, stack_overflow_error is not what we are trying to prevent here. I think that is just a guardrail for the scripted metric agg/painless doing its work.

I thought these scenarios were more geared towards integration testing edge cases. I'm just not sure we should be optimizing for all of the cases. Even if we add a safeguard for these very long traces it can make the queries much slower than they would otherwise be. I do think it's useful to have it all in there, though, but I think we and those testing on these clusters for performance need to be aware that there's an expectation things could be slower. I do think it's useful to have it all in there, though, but I think we and those testing on these clusters for performance need to be aware that there's an expectation things could be slower.

If I understand what you are saying correctly, is that we run the risk of making things slower for the common case? I think that is an acceptable trade-off. Ideally we gain some ground elsewhere but that is out of scope. This is just us making sure we don't take down an Elasticsearch node.

crespocarlos commented 1 month ago

Perhaps the fact that synthtrace is currently generating unrealistic data (with spans having their parent spans as children) makes the effort of optimizing the services map harder than it should be. The fact that Sandy is getting a stack_overflow_error could be one symptom of that.

The non-unique transaction ids could also lead to data not being deterministic because, at any given time, a transaction from a service created by a scenario could become either a parent or child span from a service created by another scenario.

For instance, service-8 can be created by both the other_bucket_group and services_map_oom scenarios. However, I believe none of them is configured to create a service map like the one below:

image

In the worst-case scenario, this could lead to a cyclic dependency.

dgieselaar commented 1 month ago

Let's separate a few things:

crespocarlos commented 1 month ago

it should not under any circumstance cause an Elasticsearch node to go down

100% agree. Just thinking here that if my PR gets merged and oblt cluters data is erased, we might lose the circumstances that are causing the OOMs

I'd also suggest refactoring the recursive function in the get_service_paths_from_trace_ids query and skipping processed ids to prevent cyclic referencing from causing infinite loops.

dgieselaar commented 1 month ago

100% agree. Just thinking here that if my PR gets merged and oblt cluters data is erased, we might lose the circumstances that are causing the OOMs

We shouldn't - if you run this script in isolation (so no chance of collision because of different synthtrace processes) it should successfully generate an OOM. But let's verify.

I'd also suggest refactoring the recursive function in the get_service_paths_from_trace_ids query and skipping processed ids to prevent cyclic referencing from causing infinite loops.

Ah, I don't think it's recursive. It's basically mapping over the services in reverse order (e.g. the deepest service first), and then creating a downstream span that has the upstream span as a child. I think... :)

dgieselaar commented 1 month ago

Also we should exclude the Service Map OOM scenario from the oblt clusters 😅

neptunian commented 1 month ago

The point of the Service Map OOM scenario is to create data that could cause an OOM. Its intent is not to generate data with cyclical dependencies, or realistic data. It simply tries to generate enough documents that are to be processed by the service map scripted metric agg to generate an OOM. You can verify if this causes an OOM by running a small Elasticsearch instance locally (run it with a very small amount of memory). If it doesn't do this, let's fix the scenario

Thanks @dgieselaar . Unfortunately @crespocarlos changes didn't resolve it, but if I just remove the recursive call (processAndReturnEvent) inside the scripted metric agg I do get the ES OOM locally. I'll fix this scenario so we have a reproducible ES OOM, but this gives me something to work with now.

For reference, this is the kibana stack overflow error:

error `ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot` `node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean` ``` [2024-06-06T08:01:21.729-04:00][ERROR][plugins.apm] Error: search_phase_execution_exception Caused by: script_exception: runtime error at unwrap_es_response.ts:45:13 at processTicksAndRejections (node:internal/process/task_queues:95:5) at fetchServicePathsFromTraceIds (fetch_service_paths_from_trace_ids.ts:225:48) at getServiceMapFromTraceIds (get_service_map_from_trace_ids.ts:58:48) at async Promise.all (index 1) at get_service_map.ts:76:30 at async Promise.all (index 0) at get_service_map.ts:124:55 at wrappedHandler (register_apm_server_routes.ts:112:35) at Router.handle (router.ts:268:30) at handler (router.ts:196:13) at exports.Manager.execute (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28) at Object.internals.handler (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20) at exports.execute (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20) at Request._lifecycle (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:371:32) at Request._execute (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:281:9) Response: { error: { root_cause: [], type: 'search_phase_execution_exception', reason: '', phase: 'rank-feature', grouped: true, failed_shards: [], caused_by: { type: 'script_exception', reason: 'runtime error', script_stack: [ 'if (context.processedEvents[eventId] != null) {\n ', ' ^---- HERE' ], script: ' ...', lang: 'painless', position: { offset: 490, start: 479, end: 545 }, caused_by: { type: 'stack_overflow_error', reason: null } } }, status: 400 } ```
neptunian commented 1 month ago

fwiw, you can count the number of documents by executing a request that only counts the documents, but does not execute the aggregation. For the second request, the filter cache will kick in and it will run the aggregation on the collected documents. Unfortunately that is a hard stop, and you cannot partially process traces with that strategy. You could chunk up the documents differently until you get a number of documents that make sense, but it's probably more sensible to nudge the user to tweak these settings if possible (?).

@dgieselaar Similar to this idea, since the query returns the shards the query ran on, I thought I could try getting the number of shards in one query (_shards":{"total":1,"successful":1,"skipped":0,"failed":0}) and then dividing the max total docs by that. @nik9000 thought this could be a viable approach. After looking at some debug logs I noticed that the reduce seemed to return more "shards" (from states) than what the query responds with. After talking with @nik9000 he looked into it and found that in modern versions of elasticsearch we parallelize aggregation collection. Aggregations get broken up across many threads to improve performance. So for example I have 1 shard but i get 4 datasets all returning many docs: [ 52988, 53249, 53291, 146440, 53432 ]. @nik9000 opened a PR to turn this off for the scripted metric agg: https://github.com/elastic/elasticsearch/pull/109597 .

I was able to disable parallelization on elasticsearch using an undocumented internal setting and verified that the states.count was the same as the shard count. Now I get a single dataset but the amount of docs can be quite high. So instead of [ 52988, 53249, 53291, 146440, 53432 ] I get [359000] which is almost exactly the same # of docs. This is the case even if i set terminate_after to something as small as 1000. terminate_after is "best effort". Depending on various factors like cardinality and your aggregation it won't necessarily adhere to anything close to that.

Instead of trying to depend on terminate_after, we can keep a counter in the map phase and not return docs after they've hit the max limit. This should ease the load on the coordinating node but the data nodes will need to map through however many docs. There doesn't seem to be a way around that. I think disabling parallelization should help here because we could get an accurate shard count and divide that by our max total doc count.

dgieselaar commented 1 month ago

Great finds! Very interesting stuff. Re: terminate_after, I think it's fine that for a small number it's inaccurate, but if terminate_after is 1000 and the aggregation still sees 360k documents that's useless. What happens if you e.g. set terminate_after to 1m? I'm surprised because we e.g. use terminate_after for search requests where we only care if there's a single hit for a query. I have never seen anything to suggest that it's actually collecting over 300k documents. If you experiment with increasing/decreasing the number, what kind of variance do you see?

We need to be careful with dividing no of shards by doc count, because data is not guaranteed to be unevenly distributed, e.g. one shard might contain 99% of the data. As data streams are partitioned by service this is not an unlikely scenario.

neptunian commented 1 month ago

I see no variation for terminate_after from tests I've done, whether parallelization is on or off. It seems ignored. But if I change the size:1 it does respect terminate_after:

scripted metric agg count with size: 0, terminate_after: 100 ``` POST /traces-*/_search { "terminate_after": 100, "size":0, "aggs": { "doc_count_metric": { "scripted_metric": { "init_script": "state.count = 0;", "map_script": """ state.count += 1; """, "combine_script": "return state;", "reduce_script": """ long totalDocCount = 0; for (state in states) { totalDocCount += state.count; // Sum the document counts from each shard } return totalDocCount; """ } } } } ``` Response: ``` { "took": 8, "timed_out": false, "terminated_early": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "doc_count_metric": { "value": 359400 } } } ```
scripted metric agg count with size: 1, terminate_after: 100 ``` POST /traces-*/_search { "terminate_after": 100, "size":1, "aggs": { "doc_count_metric": { "scripted_metric": { "init_script": "state.count = 0;", "map_script": """ state.count += 1; """, "combine_script": "return state;", "reduce_script": """ long totalDocCount = 0; for (state in states) { totalDocCount += state.count; // Sum the document counts from each shard } return totalDocCount; """ } } } } ``` Response: ``` { "took": 1, "timed_out": false, "terminated_early": true, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": 1, "hits": [ { "_index": ".ds-traces-apm-default-2024.06.13-000001", "_id": "QqpVEpABgTLW8ZcmNPqx", "_score": 1, "_source": { "observer": { "type": "synthtrace", "version": "8.15.0-preview-1716438434", "version_major": 8 }, "agent": { "name": "Synthtrace: service_map_oom" }, "trace": { "id": "41835000000000000000000000708015" }, "@timestamp": "2024-06-13T15:59:10.763Z", "service": { "node": { "name": "service-instance" }, "environment": "prod", "name": "service-0" }, "host": { "name": "service-instance" }, "event": { "outcome": "unknown" }, "processor": { "name": "transaction", "event": "transaction" }, "transaction": { "duration": { "us": 1000000 }, "name": "GET /my/function", "id": "4183500000708014", "type": "request", "sampled": true }, "timestamp": { "us": 1718294350763000 }, "tags": [ "_geoip_database_unavailable_GeoLite2-City.mmdb" ] } } ] }, "aggregations": { "doc_count_metric": { "value": 100 } } } ```

So it seemsterminate_after works so long as we have something in size (which we don't currently). Getting the existing terminate_after to kick in would be an improvement alone.

We need to be careful with dividing no of shards by doc count, because data is not guaranteed to be unevenly distributed, e.g. one shard might contain 99% of the data. As data streams are partitioned by service this is not an unlikely scenario.

Yea, or the user could have partitioned their data streams via various namespaces. I think this is a tradeoff. I'm thinking in a real world scenario there probably won't be a huge amount of matching docs from 50 traces(?) so it would be unlikely to go over the max per shard limit unless it was a very large number of shards. It does seem like a safe way to avoid the possibility of too many docs being sent to the coordinating node, though.

sidenote: I said previously "I think disabling parallelization should help here because we could get an accurate shard count and divide that by our max total doc count." I don't think this is true since whether it's on or off it returns the same docs, so I'm not sure it would matter if it was off. I spoke with @nik9000 about this and he'd rather have it off to be safe rather than some performance loss we might have, so that PR was merged.

dgieselaar commented 1 month ago

So it seemsterminate_after works so long as we have something in size (which we don't currently). Getting the existing terminate_after to kick in would be an improvement alone.

TIL! I think setting size:1 invalidates the shard request cache but that probably isn't very useful anyway.

Yea, or the user could have partitioned their data streams via various namespaces. I think this is a tradeoff. I'm thinking in a real world scenario there probably won't be a huge amount of matching docs from 50 traces(?) so it would be unlikely to go over the max per shard limit unless it was a very large number of shards. It does seem like a safe way to avoid the possibility of too many docs being sent to the coordinating node, though.

Oh, I'm not sure what you mean then. Can you give some examples with numbers?

neptunian commented 1 month ago

I opened this PR a couple days ago https://github.com/elastic/kibana/pull/186417. I added some testing directions to test locally with smaller amounts of data. To recap:

This strategy involves dividing by shards to limit the amount of total documents that can be processed. Depending on how evenly the user's data is distributed this could mean we are not able to match as many docs if some users have mostly empty shards or unevenly distributed data which means a potentially less accurate service map. This is a tradeoff.

I wanted to go over some issues still to be resolved / noted:

dgieselaar commented 1 month ago

@neptunian what is the point of setting a max trace depth?

neptunian commented 1 month ago

@neptunian what is the point of setting a max trace depth?

@dgieselaar I don't think it would be as much an issue now since we are limiting the docs. But I was getting stack overflow errors and crashing kibana using synthtrace data if I didn't add some limitation. I don't know if this is the right way to go about that handling that scenario though. I've asked @crespocarlos to look into that and he said there might be some better optimizations.

paulb-elastic commented 3 weeks ago

Closing - https://github.com/elastic/kibana/pull/186417 merged and https://github.com/elastic/kibana/issues/187707 opened