cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.6k stars 3.71k forks source link

Better observability/control for underlying network issues #63315

Open alistair-roacher opened 3 years ago

alistair-roacher commented 3 years ago

Is your feature request related to a problem? Please describe. Running a cluster across multiple locations - 2 nodes in each location - typical latency between locations is between 70 and 140ms. There are frequent networking issues such as increased latency, asymmetric latency between 2 locations and dropped packets. We are observing increased SQL execution times for short periods (typically up to 20min at a time) and we would like to be able to be able to more easily correlate these periods of increases execution time with periods of network instability.

Describe the solution you'd like We would like statistics relating to network latency and failures of pings between nodes to be exposed on the Prometheus endpoint. The DB Console already exposes a network graph, which captures the network latency between node, we would like for that data to be exposed in Prometheus.

Additionally anytime a node fails to contact another node, we would like that to be exposed as a metric in prometheus.

Describe alternatives you've considered We have managed to extract the network latency statistics from the DB Console API and feed these into Prometheus, but this is a clunky process and we are not clear on exactly what period the latency stats cover. This does not help us to know when nodes/locations are completely unable to contact each other due to underlying network issues.

Additional context Add any other context or screenshots about the feature request here.

Epic: CRDB-8500

Jira issue: CRDB-6522

knz commented 3 years ago

@piyush-singh can you assess how this fits the roadmap thanks

knz commented 3 years ago

Related #63639

github-actions[bot] commented 11 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!