[APM] Service map can cause OOM in elasticsearch

neptunian commented 3 weeks ago

Following up from https://github.com/elastic/kibana/pull/186417

When testing the service map under the maximum conditions of 1k trace ids with each trace having ~500 spans, the scripted metric aggregation can cause an OOM in elasticsearch depending on the memory available. Looking at the elasticsearch heapdump, I suspect this is due to the # of hash maps and other data structures being created simultaneously, where data can be duplicated and exist at the same time within the reduce phase. This issue did not happen when disabling parallel async requests and having them sync, when calling fetch_service_paths_from_trace_ids. Further investigation needed.

neptunian commented 3 weeks ago

@crespocarlos did some investigation https://github.com/elastic/kibana/pull/187445

elasticmachine commented 3 weeks ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

elastic / kibana

[APM] Service map can cause OOM in elasticsearch #187707