cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

storage: stream SSTs instead of KV_BATCH when sending snapshots #39716

Open jeffrey-xiao opened 5 years ago

jeffrey-xiao commented 5 years ago

38932 introduces the logic for incrementally building SSTs on the receiver side. This approach does not get the compression benefit from streaming SSTs when sending snapshots. I did an informal experiment to see the savings we'd get from streaming SSTs instead of KV_BATCH and the results are promising.

For an antagonist workload like kv0 where all the keys and values are random data and we see no benefits in compression, the amount of overhead that streaming SSTs instead of KV_BATCHES is fairly small. For kv0 I saw a 4% increase in the amount of data we were streaming for various range sizes.

For a more realistic workload like tpcc, I saw as much as an 80% reduction in the amount of data we were sending.

It might be worthwhile to send the user range as an SST and the replicated range-id local keys and range-local keys as KV_BATCHES.

Unlike #38932, this change would have to be gated behind a version flag since the logic on the sender side would have to change.

Jira issue: CRDB-5567

Epic CRDB-39898

ajwerner commented 4 years ago

One question I have here is about the effectiveness of transport level compression. By default outside of roachprod we use compression on our gRPC connections. My hunch is this isn't worth the complexity.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

erikgrinaker commented 2 years ago

Reopening this, since it'd be nice to standardize on SSTs for bulk data transport -- we have pretty mature infrastructure to handle SSTs, while the binary Pebble batch tooling in CRDB is mostly used for Raft snapshots and is much more primitive. If we could ingest the SSTs directly it'd likely net some nice performance gains too.

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

github-actions[bot] commented 10 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!