Open alistair-roacher opened 3 years ago
@piyush-singh can you assess how this fits the roadmap thanks
Related #63639
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Is your feature request related to a problem? Please describe. Running a cluster across multiple locations - 2 nodes in each location - typical latency between locations is between 70 and 140ms. There are frequent networking issues such as increased latency, asymmetric latency between 2 locations and dropped packets. We are observing increased SQL execution times for short periods (typically up to 20min at a time) and we would like to be able to be able to more easily correlate these periods of increases execution time with periods of network instability.
Describe the solution you'd like We would like statistics relating to network latency and failures of pings between nodes to be exposed on the Prometheus endpoint. The DB Console already exposes a network graph, which captures the network latency between node, we would like for that data to be exposed in Prometheus.
Additionally anytime a node fails to contact another node, we would like that to be exposed as a metric in prometheus.
Describe alternatives you've considered We have managed to extract the network latency statistics from the DB Console API and feed these into Prometheus, but this is a clunky process and we are not clear on exactly what period the latency stats cover. This does not help us to know when nodes/locations are completely unable to contact each other due to underlying network issues.
Additional context Add any other context or screenshots about the feature request here.
Epic: CRDB-8500
Jira issue: CRDB-6522