Closed awoods187 closed 1 year ago
We could also have a forcing mechanism of some sort to make sure everything not sampled is sampled after some period of time.
Might take a look at this
I don't think there's anything we can do here for 21.1, and I think a percentage of queries is fine (especially to start with). Looks like this Dapper paper by Google (https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf) talks about a similar strategy (although we sample 100x more than 1/1024 given the default sampling rate is 10/100).
I tried to implement an improvement (always sample an execution the first time we see a txn fingerprint), but the difficulty with making sampling decisions based on a finalized transaction fingerprint, or execution time, is that you have to wait until the end of the query. Since we use tracing for stats, we need to know whether or not we want to sample a transaction before we execute it.
There's been some talk about using the optimizer cost. I think this is a good idea, but will require some plumbing and maintaining some sort of histogram to build a context on whether a cost is high or not.
In conclusion I think a 10% sample rate is fine for 21.1, especially since the longer-running a workload is, the more you build up a picture of execution stats. The performance hit is relatively minimal, and the operator can always increase the rate if the stats are not up to par. Our current approach to get more granular information on which queries are slow is to enable a stmt tracing threshold which enables tracing on all transactions, but only logs the trace when the transaction latency is above the threshold. This is obviously pretty bad for performance, but since we recommend this in production right now, I don't think operators will mind increasing the sampling rate for more observability.
Another interesting approach for sampling is adaptive sampling. Basically, changing the sample rate based on some workload characteristics.
For any future work on this, however, I think it'd be good to get some actual data on whether our approach is insufficient.
@awoods187 @kevin-v-ngo I'd like to close this now that we have the "first-statement" sampling method, but curious for further thoughts...
I don't think our work here is done. It's great that we can collect a fingerprint the first time its run, but that isn't sufficient to make it predictable which statements will get sampled and it can mean that we don't sample more recent statements that fit into a particularly problematic fingerprint. I think we need to go back and look at a more deterministic way to understand what's being sampled similar to other databases sampling every n seconds.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Currently, for the DB Console's new information, we sample every statement we get a random number in [0.0, 1.0] range, if it is lower than the sample_rate setting, then we collect stats on this statement. For example, if the sample rate is 0.5, there is a one-time 50% chance any particular statement gets sampled, regardless of how long-running it is.
This can cause long-running statements to be silent in our sampling (and therefore UI) and will make sure that not every statement is sampled.
Other databases sample every n seconds (usually 1) to make sure they catch all statements currently running.
Concerns raised by @robert-s-lee
Jira issue: CRDB-2975