cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.12k stars 3.81k forks source link

Improve network detection diagnostics to prove network partitions exist between nodes #95378

Open thtruo opened 1 year ago

thtruo commented 1 year ago

Is your feature request related to a problem? Please describe. When a cluster is behaving badly due to networking issues, network partitions, or asymmetric (one-way) network partitions, etc those are currently hard to diagnose and prove while the cluster has an outage or is degraded. If a cluster is down due to a network issue, you cannot use standard CRDB diagnostics because, by definition, the cluster is already down. We need a way to confirm if and where exactly the network issues are, external to anything that relies on the cluster being up and running. The lack of this type of diagnostics has resulted in very long and drawn out troubleshooting scenarios and escalations between our CRL and customers.

Describe the solution you'd like At the end of the day, the diagnostic tooling could prove or show evidence that there are in fact issues with networking between nodes. As suggested by @smcvey ideally the diagnostics will check RPC and SQL ports from node A to node B and vice versa. The diagnostic tooling should sit outside of CRDB, perhaps a CLI command or another external network detection tool, so that it is usable even when a cluster is down.

Additional context This was inspired by an internal conversation (Slack link that is only accessible to CRL employees) cc @mwang1026

Jira issue: CRDB-23492

Epic CRDB-32137

tbg commented 1 year ago

Preamble: what we call "network connectivity issues" are often not true transport issues but are instead caused by incorrectly configured DNS resolution, etc. We need to be careful to capture those with our connectivity tool. Also, connection problems can arise from connection timeouts that are exceeded on either overloaded clusters (TLS does use some CPU for the handshake) or slow connections (large RTT). You can usually tell these apart in log messages. What's trickier are lossy connections, where the connection may sometimes drop, but often is "there" but performing very poorly.

As of #99191, CRDB exports metrics about the number of unhealthy connections. (Of course, if network connectivity prevents these timeseries from being written, you won't be able to see them). It also logs, as it has done before, but in a much cleaner format - failing connections are periodically reattempted, and an error is logged once per minute (for each). My intuition (grounded in some reality) is that these messages are actually quite accurate; they are just vanilla gRPC TLS connection! If that fails, it's unlikely to be CockroachDB's fault and these messages usually let you conclude whether the problem is at the TCP/IP level or below but really if CRDB can't connect I cannot think of an example where this was really a CRDB problem. cc @koorosh who is working on exposing connection problems in the UI[^1], we should also consider exposing the connection error (as a tooltip or something like that) when the conn is in a failed state.

And yet, we do periodically see customers who have "network issues", who claim to have run telnet and succeeded, but where the issue is ultimately resolved after "the network folks fixed the firewall". The only explanation I have for this is that the "run telnet" step wasn't actually done properly, and any instances where it was run properly and it still succeeded despite there being actual hard connectivity issues would be an important data point (cc @smcvey @irfansharif since you were both reporting relevant things on the slack thread).

So I think "we" (Support) can tell that there are network connectivity issues by looking at the logs. (And a separate cli command doesn't avoid doing that first).

My intuition is that the thing that really needs to be done here is to make it easier for users to convince themselves that TCP/DNS isn't working correctly. If the telnet checks fail, the problem is more involved.

For example, if we had an "easy" way to get the list of all advertised addresses across the cluster in host:port format in a text file, the user could run, on each CRDB node,

#!/usr/bin/env bash
out=$(mktemp)
for f in $(cat /dev/stdin); do
    echo -n $f
    if nc -v -J 5 -z $(echo "$f" | sed 's/:/ /') &> "$out"; then
        echo " ok"
    else
        echo " !!!!!!!!!!!!!!!!!!!!"
        cat "$out"
    fi
done

A toy example is here:

$ echo 'localhost:26256
localhost:26257
localhost:26258
localhost:26259' | ./scan.sh
localhost:26256 !!!!!!!!!!!!!!!!!!!!
nc: connectx to localhost port 26256 (tcp) failed: Connection refused
nc: connectx to localhost port 26256 (tcp) failed: Connection refused
localhost:26257 ok
localhost:26258 ok
localhost:26259 !!!!!!!!!!!!!!!!!!!!
nc: connectx to localhost port 26259 (tcp) failed: Connection refused
nc: connectx to localhost port 26259 (tcp) failed: Connection refused

I didn't find a super obvious way to pull all of the addresses, especially since you also want it to work if the cluster is pretty hosed, but something like this might work (though it needs more jq love to also pick up advertised addresses and per-locality advertise addresses):

$ curl -s 'http://localhost:26258/_status/nodes' | jq -r '.nodes[].desc.address.addressField'
crlMBP-VHN267714PMTY2.local:26257
crlMBP-VHN267714PMTY2.local:26267
crlMBP-VHN267714PMTY2.local:26269