Open ajwerner opened 2 years ago
@vy-ton and @rytaft can you all discuss this work and prioritize it accordingly?
I think it can be powerful for our serverless users (to help control costs/RUs) and will benefit everyone with faster scan speed (covers up poorly written queries, less than ideally optimized queries)
Queries has https://cockroachlabs.atlassian.net/browse/CRDB-14837 to track this. During 22.2 planning, we kept this off the roadmap but left it next in priority as resources free up. FYI @mgartner
We should make an apache arrow batch response format which does column projection based on the IndexFetchSpec.
In case you weren't aware, we do actually use apache arrow already in the vectorized engine, to send the columnar data around via DistSQL. The unfortunate bit is that we do not currently use the arrow serialized data as-is in the vectorized engine for most types - it requires deserialization, which sort of defeats the point. See pkg/col/colserde
which sort of defeats the point
This seems somewhat dramatic. At least some of the point here is to project out column to make the returned batch smaller. I hope that our arrow to column vector conversion is relatively high throughput.
I think it's high throughput enough, and there are some benchmarks. I just meant I've always kind of regretted not finding a way to unify the in memory representation and the arrow one for that zero copy goodness!
How sad would people feel if we don't implement the "local fast-path" with the KV projection pushdown? I mean that even if the KV Scan request is evaluated locally and we create coldata.Batch
with the necessary columns, we still would serialize it into []byte
(currently in the arrow format) and would then deserialize it in ColBatchDirectScan
.
The main argument for keeping the local fast path is that we can eliminate this serialization / deserialization step, and it seems nice to do it if we can.
However, there are several reasons for not doing it:
coldata.Batch
for each set of KVs (because we just shove the batch directly into the scan response). On the other hand, if we serialize the batch, we can then reuse the same batch for the next set of KVs. We could imagine eventually this problem could be solved if we pool coldata.Batch
objects though.BatchResponse
s from different nodes. If we have a scan request that spans multiple ranges and some of them are local while others are remote, it'll be difficult to combine the responses with different formats while maintaining the ordering.~ We'll just deserialize all of the responses and combine coldata.Batch
es while maintaining the order.Update: we probably will implement this local fast-path after all in order to eliminate (or at least substantially reduce) the perf hit when comparing against single-tenant deployment.
23.1 must-haves:
Get
requestsTraceKV
flag ofcFetcher
)KV Bytes Read
statistic?23.1 nice-to-haves:
estimatedRowCount
as the hint forcFetcherWrapper
(#94850)Later:
Is your feature request related to a problem? Please describe. One known bottleneck for cockroach performance is so-called "scan speed". In practice, this is the speed to scan data off of disk, encode it into the scan response, decode it, then re-encode it into a columnar format. The columnar format is now used extensively in execution. The above summary is misleading in a dedicated cluster: often the query execution happens in the same process as the kvserver, so the encoding and decoding step can be skipped. In multi-tenant deployments, the data must be transmitted over the network back to the server. This can be particularly costly when the data is being served from a separate availability zone ([1], https://github.com/cockroachdb/cockroach/issues/71887). The above proposal has the potential to improve the speed by 1) not decoding columns we don't need and 2) creating much smaller responses.
Any eventual movement towards columnarization at the storage layer will need to have a corresponding read API. This issue posits that we should build the columnar read API first to gain experience.
Describe the solution you'd like
We should make an apache arrow batch response format which does column projection based on the
IndexFetchSpec
.Additional context
Relates very closely to if not just adds exposition to https://github.com/cockroachdb/cockroach/issues/71887.
@jordanlewis made a prototype here: https://github.com/cockroachdb/cockroach/pull/52863. At the time it showed a ~5% win in TPCH performance.
@RaduBerinde put in a ton of work to clean up how we specify the data to be fetched. Now there exists a small protobuf which could conceivably be transmitted with the scan request and used to describe how to decode the data.
[1] We're probably going to do https://github.com/cockroachdb/cockroach/issues/72593 to attack the cross-AZ network cost problem.
Jira issue: CRDB-16284 Epic: CRDB-14837