Open ppf2 opened 3 years ago
Pinging @elastic/es-search (Team:Search)
This one is linked with #21073 and #84369 . Once we are able to break down the time spent processing a request to all the individual sub-tasks, we should think about how to connect tasks running on separate clusters that are though part of the same search execution. Same could be done for field_caps etc.
Pinging @elastic/es-search-foundations (Team:Search Foundations)
Troubleshooting long running cross cluster search requests is challenging especially in large environments with a lot of downstream clusters distributed across different regions.
In order to isolate, the users will have to extract the search request and run it against every downstream cluster separately and also against all clusters to compare the timings to see if any one or more clusters are slow, check monitoring stats/collect diagnostics for each downstream cluster while the query is executing to see if there's a bottleneck, etc.. In the async search case, test/compare using a regular search while toggling minimize roundtrip setting (on/off) for CCS to see if there could be a network latency issue.
The profile API has limitations and doesn't include things like network latency, send back times, time to reduce the results on coordinating nodes. We have something called transport tracers, but the disclaimer seems to suggest that it is not appropriate for live production debugging. Also, the output of the tracing doesn't include the search request details which makes it difficult to isolate and trace through the performance of a specific query.
It can be helpful to have to have an API or debug loggers that provide details that will help users determine