copy: figure out how to reduce kv costs further

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

29.92k stars 3.78k forks source link

copy: figure out how to reduce kv costs further #90743

Open cucaroach opened 1 year ago

cucaroach commented 1 year ago

Copy is currently mostly bound by CPU time required to execute kv batch requests. We want to figure out how to reduce this costs so ideas about optimizing the SQL layer (removing datums etc) start to make more sense. Ideas to explore:

use columnar kv batches instead of row, @nvanbenschoten sez this could "reduce the number of ranges that a given batch touches, so each range would handle larger batches"
play with bigger batch sizes and kv.transaction.write_pipelining_max_batch_size, why does throughput max out at 100 rows?
is load based splitting kicking in, if not could we make it kick in sooner?
can we make a case that multiple concurrent requests should be possible to the same range? possibly only makes sense for reads.

See image for current COPY profile:

can we do anything about the kvclient side costs which also seem to dominate string parsing and kv encoding (see image)
like can we do something about the sorting? ie could we split the per-index sorting costs across multiple goroutines? kv encoding costs too?

Jira issue: CRDB-20919

Epic CRDB-18892

DrewKimball commented 1 year ago

We're spending a lot of time in mvcc.GetMetadata, any chance we could use MVCCBlindInitPut instead of MVCCInitPut? It seems like it might be possible for copy...

Maybe we could make the kv/storage stuff faster by improving locality? If my reading is correct, we only buffer 100 rows in the copy logic before flushing - is that true? We could buffer more (say up to the working memory limit or even more) and then sort before flushing batches from the buffer. I don't know how complicated it would be, but it might also improve things if we updated the indexes one at a time.

WRT the sorting, how sorted are things to start with? Maybe we could avoid work by first checking for the already-sorted runs, and only sorting chunks of runs that are too small. Then we could merge the runs the remain. Or if it's common for everything to already be sorted we could check for that, then skip the sort.

cucaroach commented 1 year ago

Thanks @DrewKimball I like the cut of your jib!

We're doing blind puts for PK and init puts for the secondary indexes, I'm not exactly sure why but I'll look into it.

Also interesting that there's zero sorting overhead if I take out the indexes. The input is PK sorted so I think we're just using an algorithm that likes pre-sorted data. The secondary indexes suffer from having real sorting work to do. But all that sorting happens in the kvclient layer so its serialized single CPU for each index. Maybe if I did columnar batches I could sort them in separate goroutines and eliminate that overhead, or maybe even better the kvclient code could break out a parallel sort algorithm for large batches so we didn't have to worry about it although the parallelization at the SQL layer is attractive because it gives me a way to do the kv encoding in parallel as well.

I do think 100 rows at a time is small and part of this work will be to figure out why throughput suffers with more than that.

DrewKimball commented 1 year ago

Is there any possibility to use AddSSTable?

cucaroach commented 1 year ago

Not directly, indirectly through #85573 its being considered.

cucaroach commented 1 year ago

Closing this as I learned that if the indexes are added in a separate transaction the index backfill does this for us. See #91087 for possible enhancement.

cucaroach commented 1 year ago

Reopening, would like to get some storage folks thoughts on the SeekPrefixGE stuff and whether there's any easy wins there...

jbowens commented 1 year ago

I don't think there's any easy wins there—SeekPrefixGE is the highly optimized get read path.

cucaroach commented 1 year ago

Is there anything that can be done if you have not one put but a bag of (possibly sorted) puts coming in at once?

DrewKimball commented 1 year ago

We're doing blind puts for PK and init puts for the secondary indexes, I'm not exactly sure why but I'll look into it.

Did this ever turn anything up? I still don't understand why we can do blind puts for PK but not secondary.

cucaroach commented 1 year ago

And this is a COPY into an empty table inserting a couple hundred megabytes of data into a handful of ranges all in the same transaction. Is there no way to get more of what we need to do in memory? I have the CacheSize set to 2GB but I haven't done experiments to see if making it bigger/smaller helps/hurts.

cucaroach commented 1 year ago

We're doing blind puts for PK and init puts for the secondary indexes, I'm not exactly sure why but I'll look into it.

Did this ever turn anything up? I still don't understand why we can do blind puts for PK but not secondary.

I think I got this backwards, we do CPut for PK to detect unique violations and InitPuts for the secondary indexes. I'm actually not sure how we detect uniqueness violations for secondary indexes. I think we only use blind puts for rare circumstances described here.

jbowens commented 1 year ago

Is there anything that can be done if you have not one put but a bag of (possibly sorted) puts coming in at once?

If they are sorted and there are no overlapping keys in the Engine, a fast path is possible that uses a non-prefix seek. This is what storage.CheckSSTConflicts does. This wouldn't be an easy win if we're trying to replace the read during a Put.

nicktrav commented 1 year ago

We talked about this one today again. We're not aware of any optimizations here that are straight forward to pull off.

@cucaroach - are we ok moving this back to closed? Or perhaps splitting off a more specific request?

cucaroach commented 1 year ago

I'll take it off the storage board for now until we have something more concrete to ask for. Thanks for spending some time thinking about it!