rpc: export metrics for latencies between peers

tbg commented 2 years ago

Is your feature request related to a problem? Please describe. Our metrics currently give us no insight into the latency between two nodes n1 and n2; all we have is a histogram which receives the measured latencies between all nodes, i.e. a large grab bag. The Node Latency Matrix does give us per-peer information, but nothing historical, and besides it's pretty buggy (https://github.com/cockroachdb/cockroach/issues/58033) and a poor match for alertability. In particular, it will show the most recent successful measurements, i.e. will show "good" results when network links degrade to the point where probes fail.

Describe the solution you'd like

At the very least we should fix #58033, but beyond that there is a strong case that these pair-wise latencies should be recorded to timeseries. Since our internal timeseries does not support labels (or dynamic timeseries names of any kind) and there is no desire to add this, this would be added for prometheus only.

To get this there is likely a need to spend a decent amount of time wrangling with our metrics package, which does not use prometheus-go's data structures as first citizens and is thus fairly resistant to starting to use prometheus histograms with labels for this purpose, as we would want to. Nevertheless, this can be done.

Then we can use the rpc heartbeats as a reasonable proxy for network latency, making sure to record something sensible when there are network issues (for example, a second histogram to which we record on heartbeat errors, and which could be used to reason about network partitions after the fact).

Describe alternatives you've considered We could pretend that it's the operator's task to observe their networks. However, in practice, on on-prem this is a naive approach.

We could also go further and not only record network latencies from our heartbeats, but for all RPCs (or at least important/frequent ones). This would give us more data to look at. It's important here that we're distinguishing network latency from service latency. We care about network latency here, i.e. we'd need to do something where before sending an outgoing request we remember a start timestamp and at the receiver we measure the duration that the request spends being handled at the server, send that back to the sender, who will then compute end-start-duration to extract a close approximation of network latency (doing it this way avoids mixing timestamps taken from different wall clocks).

I think on the sending side this could be done in an UnaryClientInterceptor:

    var serviceDuration = math.MaxUint64
    tBegin := timeutil.Now()
    err := invoker(ctx, req, resp)
    if g, ok := resp.(interface {
        ServerSideDuration() time.Duration
    }); ok {
        serviceDuration timeutil.Since(tBegin) - g.ServerSideDuration()
    }
    // Problem: want to record with a nodeID label for the destination, but we
    // have no clue who that is here. There are probably ways to annotate the
    // ClientConn with this. And if all else fails, at least we have the address.
    someHistogram.Record(serviceDuration, someLabel)

and something similar could be done on the server with a UnaryServerInterceptor that populates ServerSideDuration(). All we need for this to work is for the proto to implement the respective interface, which we could enforce.

For streaming RPCs this would be harder but maybe we don't need that.

Additional context

I'm adding this to Obs Inf's project since it is "easy" to add this once the infrastructure is in place.

Jira issue: CRDB-10430

gz#13827

Epic CRDB-32137

tbg commented 2 years ago

@joshimhoff I wonder how big of a priority this is for CC. Do you folks have good monitoring of network links set up anyway and this wouldn't add anything? An example to consider is a recent occurrence (involving @sumeerbhola) that we had on dedicated where a single network link would incur 5-10% packet loss intermittently (every couple of minutes). Something like this can cause trouble, but it doesn't necessarily result in "obvious errors" in the logs. How would something like that play out in CC?

joshimhoff commented 2 years ago

An example to consider is a recent occurrence (involving @sumeerbhola) that we had on dedicated where a single network link would incur 5-10% packet loss intermittently (every couple of minutes).

This happened in CC dedicated? Can I get a link with details if so? I want to know why alerting didn't fire.

How would something like that play out in CC?

My hope is we'd see some indication of the trouble via our monitoring stack, esp. kvprober that writes. If you think we wouldn't, may I suggest we prioritize https://github.com/cockroachdb/cockroach/issues/71169 haha? I don't want to alert on the cause of high packet loss; that is the way to ops death, as alerting on all the possible causes leads to too complex of an alerting config and lots of false positives in my experience.

I wonder how big of a priority this is for CC.

Despite what I say above, more time-series metrics always good as monitoring != alerting.

With that said, IMHO, in a CC context, it would be better to measure network latency from some daemon that runs on all k8s nodes, rather than from CRDB, as measuring from CRDB introduces correlations between CRDB production issues and the network latency measurements, and those correlations confuse the debugging of production outages, as it makes it hard to figure out which is the root cause and which is the effect of the root cause!!

sumeerbhola commented 2 years ago

it would be better to measure network latency from some daemon that runs on all k8s nodes, rather than from CRDB

my 2 cents: this should ideally be an and. Measuring from outside is good to have, but we need to see what the CockroachDB nodes are observing as network latency (or having RPCs queued up and never sent until the deadline expires, because bandwidth is poor -- which this issue doesn't talk about, but ought to), since that is what affects the cluster behavior (perhaps the daemon uses a slightly different configuration that gives it a different network path or QoS, or there is a problem specifically with the CockroachDB pod).

The problem with the current situation is that since we don't know what is being observed by the CockroachDB RPCs we (a) first try to eliminate all other problems that could be happening in a CockroachDB cluster (which is an extremely long list), (b) try to hypothesize whether other external measurements of packet loss or bandwidth show significant enough degradation to be of concern. It would be obvious that they were of concern if those external measurements correlated with RPCs sitting in the outgoing queue for (say) 500ms and exceeding deadline, or successful RPCs showed the network latency was significantly higher than normal.

joshimhoff commented 2 years ago

my 2 cents: this should ideally be an and. Measuring from outside is good to have, but we need to see what the CockroachDB nodes are observing as network latency

Ya +1. I agree with this totally despite what I said.

@sumeerbhola, did the issue Tobias mentioned happen in CC dedicated? Can I get a link with details if so? I want to know why alerting didn't fire.

tbg commented 2 years ago

No, it was on-prem and the problem was that we were flying completely blind as the customer was unable to pinpoint the problem in their networking using their (likely insufficient) monitoring infra.

(or having RPCs queued up and never sent until the deadline expires, because bandwidth is poor -- which this issue doesn't talk about, but ought to)

I made an edit to that effect.

tbg commented 1 year ago

https://github.com/cockroachdb/cockroach/pull/99191 merged a few weeks ago. It does add per-peer metrics, however they are gated behind the server.child_metrics.enabled flag (and are only exposed on the prom endpoint, not in internal timeseries). As such, they're not going to be available (only their sum is available - but that sum makes little sense) when collecting from on-prem customers, but we should roll out this setting on Cloud (filed https://cockroachlabs.atlassian.net/browse/CC-24928 to discuss this). For example, here is what the latency metrics look like:

$ roachprod create -n 3 local
$ roachprod stage local cockroach
$ roachprod start local
$ roachprod sql local:1 -- -e 'set cluster setting server.child_metrics.enabled=true;'
$ curl -s 'http://localhost:26258/_status/vars' | grep -E '^rpc_connection_avg.*remote_node_id="[^0]'
rpc_connection_avg_round_trip_latency{node_id="1",remote_node_id="1",remote_addr="crlMBP-VHN267714PMTY2.local:26257",class="default"} 344957
rpc_connection_avg_round_trip_latency{node_id="1",remote_node_id="2",remote_addr="crlMBP-VHN267714PMTY2.local:26267",class="default"} 970019
rpc_connection_avg_round_trip_latency{node_id="1",remote_node_id="2",remote_addr="crlMBP-VHN267714PMTY2.local:26267",class="system"} 915820
rpc_connection_avg_round_trip_latency{node_id="1",remote_node_id="3",remote_addr="crlMBP-VHN267714PMTY2.local:26269",class="default"} 1.002118e+06
rpc_connection_avg_round_trip_latency{node_id="1",remote_node_id="3",remote_addr="crlMBP-VHN267714PMTY2.local:26269",class="system"} 945153

It would be a worthwhile conversation to have (perhaps with obs-inf) on whether there is something tactical we can do to benefit from this work to support on-prem customers (which are also the ones whose networks act up the most, which is something that we spent excessive amounts of time on when it happens since self-diagnosis is often weak). cc @nkodali

data-matt commented 11 months ago

@tbg our large on-premise customers should be monitoring their prometheus metrics. So this should be at least a step in the right direction.

Unfortunate that we won't be able to tsdump them.

If I'm reading this right, when server.child_metrics.enabled is true, this exposes these metrics on the prometheus endpoint?

tbg commented 11 months ago

If I'm reading this right, when server.child_metrics.enabled is true, this exposes these metrics on the prometheus endpoint?

That's correct! I think it would make sense to expose these metrics by default, but was initially following the existing pattern for labelled metrics. The obs-inf or cluster-obs team could speak to whether we can do this (and could own doing so).

cockroachdb / cockroach

rpc: export metrics for latencies between peers #71193