Open joshimhoff opened 1 year ago
CC @tbg
hardware overload
What do you mean by that? You will measure hardware overload no matter what, no? Similarly, network latencies are measured too, and if there is network congestion that too will influence the measurement. I'd think all of this is intentional, too.
Ideally, AC should shift queuing into the AC system rather than the goroutine scheduler, etc. So, for the same reason we expect node liveness HBs & kvprober requests to succeed under hardware overload, we expect kv.replica_read_batch_evaluate.single_row.latency
to not increase much under hardware overload. FWIW, I expect it will increase a bit, and that is fine, and also at least in 22.2 there are some known gaps in AC such as no AC on followers that can lead to significant queueing outside of AC.
Perhaps the problem is the term "hardware overload". In this case, by "hardware overload", I mean some hardware resource is saturated but only a little since AC is working.
Re: network, yes, since AC doesn't take this into account. We have no observed that problem in practice in CC to date.
I would replace the word "hardware overload" with "hardware queueing". Specifically, there are a few places I could see hardware queueing: 1) Network - if the aggregate throughput is greater than the link bandwidth, then gRPC calls will stall before sending (since TCP typically has a small window) 2) CPU - If the number of runnable goroutines is >1, then the go scheduler will prioritize some operations over others 3) Disk - if the queue depth is > disk queue depth, then it will wait prior to issuing a read/write to the disk. (*) 4) Memory - if the memory reads/writes exceed the memory bus speed, then the CPU will stall
For DIsk it is a little more tricky since the OS will also queue and reorder operations since we are going through the file system and not directly to the raw disk. There are metrics that the file system has (dirty bytes, ...) that can be used to determine how far behind it is on writes.
Ideally, AC would reduce the size of these queues to prevent hardware queuing. I'm not sure it will want to consider all of them though. For instance memory queueing requires a deep understanding of caches and numa architecture, so that is likely outside the scope of what AC should do.
"single row" seems overly specific and also maybe not what it sounds like you're looking for here -- a "row" could be 200 reads or one read depending on how many column families and plenty of workloads are going to be scans rather than single row reads.
It sounds like what you're getting at is measuring the latency / service time of the storage layer in finding and returning the data requested by the transaction layer, as an indicator of storage performance which is independent of performance of the transaction layer sitting on top of it, including anything like AC queueing, latch acquisitions, etc that happen in that transaction layer? You just want a pure indicator of how quickly the storage layer is getting the data to it?
For that, I think it sounds like the iterator stats, specifically the timing of Seek() and Next() calls, might get you closer to what you want, i.e. the retrieval performance of the storage engine?
I'd think all of this is intentional, too.
Just noticed this bit. Hardware overload is a problem for the customer experience so makes sense to measure it with certain metrics. But we don't want to wake up an SRE at 3am because a cluster is hardware overloaded. As a result, for this metric, I like that it mostly doesn't capture hardware overload.
I would replace the word "hardware overload" with "hardware queueing".
Good idea. Updating ticket title.
Ideally, AC would reduce the size of these queues to prevent hardware queuing. I'm not sure it will want to consider all of them though.
Agreed! I think it's okay that AC is not perfect. AC needs to be good enough that SRE / dev can page on things like kvprober / this & not receive too many false positive pages (pages with cause being hardware overload). Some false positive pages are fine tho.
For that, I think it sounds like the iterator stats, specifically the timing of Seek() and Next() calls, might get you closer to what you want, i.e. the retrieval performance of the storage engine?
I'll take a look at that. Thanks for idea!
Re: goal, the goal you are laying out is perhaps not the same as the goal I have. I think the goal is to measure E2E SQL perf in latency sense for the single SQL row SELECT workload, but with certain pieces of latency that are generally caused by customer workload choices, e.g. contention + hardware queueing, excluded. You might say: Josh, you are not measuring any SQL stuff. Agreed but that is more for expediency! One way to think about this ticket is a very scrappy & minimal version of https://github.com/cockroachdb/cockroach/issues/71169#issuecomment-1049281030.
"single row" seems overly specific and also maybe not what it sounds like you're looking for here -- a "row" could be 200 reads or one read depending on how many column families and plenty of workloads are going to be scans rather than single row reads.
Re: 200 reads or one read depend on column families, let's see if that is a problem in practice.
Re: scans, I think that is out of scope. Useful to know perf in latency sense of non-scan workloads, even tho people do scans. Scans are tricky since perf is expected to be extremely variable.
I guess I'm arguing with the goal; why is your goal "single row SELECT latency that excludes `$stuff" ?
Is that a goal because we think if we get it we will be able to answer questions like like "how does disagg storage affects performance?" or "does this node on this cluster have a dodgy EBS volume and need to trigger an alert?", without the confounding variables of per-workload differences like contention, AC, schema, etc?, If so, then I think having a metric for "single row SQL read excluding $stuff" isn't actually the right goal, that if we had it would best give us those answers, for reasons like the ones I had above. In particular, I think exposing the pebble iterator operation latencies, measured below all of the "
Ya, it's an interesting point. I shall dig more into details of exactly what the iterator stats are measuring & then report back!
Is your feature request related to a problem? Please describe. Would be nice to measure point read latency of single row SQL reads done by users, without capturing contention or hardware overload. CRDB should deliver predictable performance for single row SQL reads.
Describe the solution you'd like Implement
kv.replica_read_batch_evaluate.single_row.latency
. Similar tokv.replica_read_batch_evaluate.latency
but only captures read requests that touch a single SQL row (or similar e.g. allow joins by FK). @rytaft suggests a way to check if a read touch just one SQL row at https://github.com/cockroachdb/cockroach/issues/71169#issuecomment-1049857896. We could do the check in SQL land, attach some metadata to the context indicating that only a single SQL row is affected, then plumb that intoBatchRequest
like we do with tracing info, then plumb intokvserver
context, the use to implementkv.replica_read_batch_evaluate.single_row.latency
.Describe alternatives you've considered https://github.com/cockroachdb/cockroach/issues/71169 is a more complete solution but will take much longer to implement.
Additional context This will help with:
Jira issue: CRDB-25182