sql,kv,storage: push column batch generation into kvserver

ajwerner commented 2 years ago

23.1 must-haves:

[x] introduce local fastpath
[ ] figure out whether we want to support Get requests
[ ] figure out what to do with tracing (i.e. TraceKV flag of cFetcher)
[ ] what exactly do we want to show for KV Bytes Read statistic?

23.1 nice-to-haves:

[ ] support index joins (#94807)
[ ] propagate estimatedRowCount as the hint for cFetcherWrapper (#94850)
[ ] support filter pushdown
[ ] #99838

Later:

[ ] #92950
[ ] #92954

Is your feature request related to a problem? Please describe. One known bottleneck for cockroach performance is so-called "scan speed". In practice, this is the speed to scan data off of disk, encode it into the scan response, decode it, then re-encode it into a columnar format. The columnar format is now used extensively in execution. The above summary is misleading in a dedicated cluster: often the query execution happens in the same process as the kvserver, so the encoding and decoding step can be skipped. In multi-tenant deployments, the data must be transmitted over the network back to the server. This can be particularly costly when the data is being served from a separate availability zone ([1], https://github.com/cockroachdb/cockroach/issues/71887). The above proposal has the potential to improve the speed by 1) not decoding columns we don't need and 2) creating much smaller responses.

Any eventual movement towards columnarization at the storage layer will need to have a corresponding read API. This issue posits that we should build the columnar read API first to gain experience.

Describe the solution you'd like

We should make an apache arrow batch response format which does column projection based on the IndexFetchSpec.

Additional context

Relates very closely to if not just adds exposition to https://github.com/cockroachdb/cockroach/issues/71887.

@jordanlewis made a prototype here: https://github.com/cockroachdb/cockroach/pull/52863. At the time it showed a ~5% win in TPCH performance.

@RaduBerinde put in a ton of work to clean up how we specify the data to be fetched. Now there exists a small protobuf which could conceivably be transmitted with the scan request and used to describe how to decode the data.

[1] We're probably going to do https://github.com/cockroachdb/cockroach/issues/72593 to attack the cross-AZ network cost problem.

Jira issue: CRDB-16284 Epic: CRDB-14837

awoods187 commented 2 years ago

@vy-ton and @rytaft can you all discuss this work and prioritize it accordingly?

I think it can be powerful for our serverless users (to help control costs/RUs) and will benefit everyone with faster scan speed (covers up poorly written queries, less than ideally optimized queries)

vy-ton commented 2 years ago

Queries has https://cockroachlabs.atlassian.net/browse/CRDB-14837 to track this. During 22.2 planning, we kept this off the roadmap but left it next in priority as resources free up. FYI @mgartner

jordanlewis commented 2 years ago

We should make an apache arrow batch response format which does column projection based on the IndexFetchSpec.

In case you weren't aware, we do actually use apache arrow already in the vectorized engine, to send the columnar data around via DistSQL. The unfortunate bit is that we do not currently use the arrow serialized data as-is in the vectorized engine for most types - it requires deserialization, which sort of defeats the point. See pkg/col/colserde

ajwerner commented 2 years ago

which sort of defeats the point

This seems somewhat dramatic. At least some of the point here is to project out column to make the returned batch smaller. I hope that our arrow to column vector conversion is relatively high throughput.

jordanlewis commented 2 years ago

I think it's high throughput enough, and there are some benchmarks. I just meant I've always kind of regretted not finding a way to unify the in memory representation and the arrow one for that zero copy goodness!

yuzefovich commented 1 year ago

How sad would people feel if we don't implement the "local fast-path" with the KV projection pushdown? I mean that even if the KV Scan request is evaluated locally and we create coldata.Batch with the necessary columns, we still would serialize it into []byte (currently in the arrow format) and would then deserialize it in ColBatchDirectScan.

The main argument for keeping the local fast path is that we can eliminate this serialization / deserialization step, and it seems nice to do it if we can.

However, there are several reasons for not doing it:

~this fast path will be removed in the Unified Architecture and only benefits the single tenant world for some time.~ We actually should be able to keep it.
this fast path requires allocating a fresh coldata.Batch for each set of KVs (because we just shove the batch directly into the scan response). On the other hand, if we serialize the batch, we can then reuse the same batch for the next set of KVs. We could imagine eventually this problem could be solved if we pool coldata.Batch objects though.
~this fast path makes it tricky to combine BatchResponses from different nodes. If we have a scan request that spans multiple ranges and some of them are local while others are remote, it'll be difficult to combine the responses with different formats while maintaining the ordering.~ We'll just deserialize all of the responses and combine coldata.Batches while maintaining the order.

Update: we probably will implement this local fast-path after all in order to eliminate (or at least substantially reduce) the perf hit when comparing against single-tenant deployment.

cockroachdb / cockroach

sql,kv,storage: push column batch generation into kvserver #82323