kv,*:state inspection pages for a cluster node

sumeerbhola commented 3 years ago

(This is a tracking issue for discussion of specific ideas that can be spun off into separate issues)

We lack inspectz-style pages (google terminology) on a node, which would show a view on the current state of certain data-structures within a node. These would be used when metrics or traces have indicated that we need to look more closely at a particular node.

Possible examples: states of (explicit or implicit) queues (e.g. for queues for latches and locks) including who is waiting and for how long; current LSM state and ongoing compactions etc. These don’t need to be fast to generate since they would be used sparingly (in the worst case could take a few seconds, if the internal structure is large, and cause a few ms delay in running queries). Such pages can use filters to make the inspected data manageable e.g. filtered to a range, txnid, key range etc.

This was less important when debug.zip was the primary way to troubleshoot, but we have direct access in CC and for important customers for whom extremely short remediation time is critical.

Needless to say, deciding what state needs such a page is critical and needs to be informed by actual troubleshooting experience. The tooling around this should make it very easy to create one (i.e., any complexity should be limited to how to construct the view of the internal structure and not on how to pass in filtering parameters or display/format the output).

@jbowens raised the following in the internal slack thread: is there some risk in having separate observability regimes for clusters that we have direct access to versus not? maybe we should think of a debug.zip as just a response format. if interacting directly with an inspectz-style UI, the UI requests a thin, filtered debug.zip of just the requested data. otherwise, we can instruct customers to generate a debug.zip with the same information, which may be loaded into the same UI

Jira issue: CRDB-8222

jbowens commented 3 years ago

In some cases there are difficulties when the internal state is too big (the filter situation I mentioned above).

To clarify, I'm suggesting that debug.zip generation supports the same type of filtering we'd do in the inspectz-style page. The turnaround is longer if we're asking a customer to run a command, so maybe narrow filtering isn't that useful without the ability to iterate and investigate in real time.

sumeerbhola commented 2 years ago

The tooling around this should make it very easy to create one (i.e., any complexity should be limited to how to construct the view of the internal structure and not on how to pass in filtering parameters or display/format the output).

Note that this issue is about the tooling, and not about constructing the actual pages, which given the tooling becomes usually trivial. One important part of the tooling is the ability to create a multi-column table with string and numeric types, that can be re-sorted in the end-user's browser in descending/ascending order of any column. The end-user being able to filter the table using a regexp search on a string column would be an additional plus.

irfansharif commented 1 year ago

We're introducing lots of high-cardinality state as part https://github.com/cockroachdb/cockroach/issues/95563. Each node for example is maintaining token buckets per "replication stream", defined by <tenant id,store id> it's issuing replication traffic on behalf of + to. Each proposer replica itself managing a range-oriented view of active replication streams. These are hard things to observe through aggregate metrics, but inspectz style pages seem a lot more apprioriate, to zoom into in-memory state for which replication streams are blocked (due to unavailable flow tokens), which ranges are blocked and due to which replicas specifically.

cockroachdb / cockroach

kv,*:state inspection pages for a cluster node #66772