Open erikgrinaker opened 1 year ago
we should consider graphing a few notable ones, and also the total count (which requires a new metric) It would be worthwhile for someone to take a holistic view of which error metrics we have, which metrics we want, and how to communicate them to users in a way that's meaningful and understandable.
@erikgrinaker do you know which are the notable errors we should count? That would help us make quicker progress (prioritize) on this issue.
Afraid I don't have time to go over this now, maybe someone on KV can.
The Distributed → RPC Errors chart in the DB Console can be misleading. It only graphs two kinds of errors:
distsender.rpc.sent.nextreplicaerror
: number of retryable replica errors that cause the DistSender to go try a different replica.distsender.errors.notleaseholder
: NotLeaseHolderError, i.e. the replica contacted a replica that wasn't the leaseholder and tried a different replica.Both of these errors are benign, will be retried, and are expected during normal operation -- but they can still be of some interest during debugging, in some cases.
However, it strikes me as odd that we chart these benign, retryable errors, but we don't chart actual RPC errors that result in errors to the client at all. Furthermore, spikes in these charts (e.g. following a node restart) can cause undue alarm with customers -- see e.g. https://github.com/cockroachlabs/support/issues/2527 where a customer saw spikes during and after an upgrade, which they thought indicated problems with the upgrade, but were entirely normal and expected when doing a rolling restart.
We should do two things here:
This chart shouldn't be named RPC errors, since that isn't entirely accurate. We should downplay these kinds of internal retryable errors that are expected during normal operation and have negligible workload impact in the common case. We can still graph them, but make it clear that these are typically normal and expected, and don't result in client errors.
Chart actual RPC errors. We have a bunch of metrics for different error types under
distsender.rpc.err.%s
, but unfortunately don't have a counter across all error types -- we should consider graphing a few notable ones, and also the total count (which requires a new metric). Note that these are counted on the DistSender client node, not on the server node. We also haveexec.error
which counts number of KV batch requests failures on a server node -- some of these errors are benign (e.g. ConditionFailedError), some aren't, but we don't differentiate between error types here.It would be worthwhile for someone to take a holistic view of which error metrics we have, which metrics we want, and how to communicate them to users in a way that's meaningful and understandable.
Jira issue: CRDB-30531