Better tooling/logs for troubleshooting long running CCS requests

ppf2 commented 3 years ago

Troubleshooting long running cross cluster search requests is challenging especially in large environments with a lot of downstream clusters distributed across different regions.

In order to isolate, the users will have to extract the search request and run it against every downstream cluster separately and also against all clusters to compare the timings to see if any one or more clusters are slow, check monitoring stats/collect diagnostics for each downstream cluster while the query is executing to see if there's a bottleneck, etc.. In the async search case, test/compare using a regular search while toggling minimize roundtrip setting (on/off) for CCS to see if there could be a network latency issue.

The profile API has limitations and doesn't include things like network latency, send back times, time to reduce the results on coordinating nodes. We have something called transport tracers, but the disclaimer seems to suggest that it is not appropriate for live production debugging. Also, the output of the tracing doesn't include the search request details which makes it difficult to isolate and trace through the performance of a specific query.

It can be helpful to have to have an API or debug loggers that provide details that will help users determine

How long did it take for the request to be sent from CCS node to each downstream cluster
How long did it take for each downstream cluster to process its request
How long did it take for each downstream cluster to send its response back to the CCS node
How long did it take for CCS node to process the results collected from all clusters
How long did it take for CCS node to send its response back to the originating client

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

javanna commented 2 years ago

This one is linked with #21073 and #84369 . Once we are able to break down the time spent processing a request to all the individual sub-tasks, we should think about how to connect tasks running on separate clusters that are though part of the same search execution. Same could be done for field_caps etc.

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elastic / elasticsearch

Better tooling/logs for troubleshooting long running CCS requests #73922