cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

sql: evaluate other sampling rate options #61678

Closed awoods187 closed 1 year ago

awoods187 commented 3 years ago

Currently, for the DB Console's new information, we sample every statement we get a random number in [0.0, 1.0] range, if it is lower than the sample_rate setting, then we collect stats on this statement. For example, if the sample rate is 0.5, there is a one-time 50% chance any particular statement gets sampled, regardless of how long-running it is.

This can cause long-running statements to be silent in our sampling (and therefore UI) and will make sure that not every statement is sampled.

Other databases sample every n seconds (usually 1) to make sure they catch all statements currently running.

Concerns raised by @robert-s-lee

Jira issue: CRDB-2975

awoods187 commented 3 years ago

We could also have a forcing mechanism of some sort to make sure everything not sampled is sampled after some period of time.

asubiotto commented 3 years ago

Might take a look at this

asubiotto commented 3 years ago

I don't think there's anything we can do here for 21.1, and I think a percentage of queries is fine (especially to start with). Looks like this Dapper paper by Google (https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf) talks about a similar strategy (although we sample 100x more than 1/1024 given the default sampling rate is 10/100).

I tried to implement an improvement (always sample an execution the first time we see a txn fingerprint), but the difficulty with making sampling decisions based on a finalized transaction fingerprint, or execution time, is that you have to wait until the end of the query. Since we use tracing for stats, we need to know whether or not we want to sample a transaction before we execute it.

There's been some talk about using the optimizer cost. I think this is a good idea, but will require some plumbing and maintaining some sort of histogram to build a context on whether a cost is high or not.

In conclusion I think a 10% sample rate is fine for 21.1, especially since the longer-running a workload is, the more you build up a picture of execution stats. The performance hit is relatively minimal, and the operator can always increase the rate if the stats are not up to par. Our current approach to get more granular information on which queries are slow is to enable a stmt tracing threshold which enables tracing on all transactions, but only logs the trace when the transaction latency is above the threshold. This is obviously pretty bad for performance, but since we recommend this in production right now, I don't think operators will mind increasing the sampling rate for more observability.

asubiotto commented 3 years ago

Another interesting approach for sampling is adaptive sampling. Basically, changing the sample rate based on some workload characteristics.

For any future work on this, however, I think it'd be good to get some actual data on whether our approach is insufficient.

yuzefovich commented 3 years ago

63012 will add the collection of stats on a per stmt basis the first time we see a particular stmt.

jordanlewis commented 3 years ago

@awoods187 @kevin-v-ngo I'd like to close this now that we have the "first-statement" sampling method, but curious for further thoughts...

awoods187 commented 3 years ago

I don't think our work here is done. It's great that we can collect a fingerprint the first time its run, but that isn't sufficient to make it predictable which statements will get sampled and it can mean that we don't sample more recent statements that fit into a particularly problematic fingerprint. I think we need to go back and look at a more deterministic way to understand what's being sampled similar to other databases sampling every n seconds.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!