compare `MVCCScan` interface to lower-level CGo iterator interface

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

29.99k stars 3.79k forks source link

compare `MVCCScan` interface to lower-level CGo iterator interface #38116

Closed ajkr closed 5 years ago

ajkr commented 5 years ago

A few months ago I ran some experiments where the low-level iterator interface that crosses the CGo barrier for every Next() performed better than expected compared to the batch scan API (MVCCScan). I didn't do a good job of recording the methodology or the results, so this has just been stuck in the back of my mind for a while.

Recently I had a chance to try again. With a fresh experiment setup and measurements, the low-level iterator interface is again showing higher throughput in most cases, often significantly. In this issue, I will record all the repro steps and investigation.

Experiment branch: https://github.com/ajkr/pebble/tree/experiment-scripts

Build:

In cockroach dir, apply the below patch. This is necessary to treat the zero timestamp keys, which pebble ycsb generates, as regular keys rather than intents. I am guessing that this hack doesn't have any outcome-altering side effects, but please let me know if it looks too suspicious. We can always modify pebble ycsb to generate keys with real timestamps suffixed.

$ git apply -p1 <<EOF
diff --git a/c-deps/libroach/mvcc.h b/c-deps/libroach/mvcc.h
index d856be9a75..9e56c62416 100644
--- a/c-deps/libroach/mvcc.h
+++ b/c-deps/libroach/mvcc.h
@@ -209,7 +209,7 @@ template <bool reverse> class mvccScanner {
}

bool getAndAdvance() {
-    const bool is_value = cur_timestamp_ != kZeroTimestamp;
+    const bool is_value = true;

 if (is_value) {
   if (timestamp_ >= cur_timestamp_) {
EOF

In cockroach dir, make libroach librocksdb.
In pebble dir, go build ./cmd/pebble

Experiment:

Initialize a database. Here's one that's 2-3GB with 256-byte values: go run ./cmd/pebble ycsb ./tmp-rocks --wipe --use-rocksdb --workload insert=100 --values 256 --num-ops 10000000 --concurrency 256 --duration 0s --wait-compactions
Run with scan batching disabled: go run ./cmd/pebble ycsb ./tmp-rocks --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
Run with scan batching enabled: go run ./cmd/pebble ycsb ./tmp-rocks --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100

Another minor patch that gives more stable results is to set level0_file_num_compaction_trigger=1 in libroach. By default it's two, which causes each trial run to alternate between having 0 L0 sstables and 1 L1 sstable due to the flush that happens during recovery.

Results:

value size	scan size	use-batch-scan	ops/sec
64	10	0	53149
64	10	1	47663
64	100	0	25655
64	100	1	27074
64	1000	0	4356
64	1000	1	5029
256	10	0	43540
256	10	1	41156
256	100	0	21417
256	100	1	18006
256	1000	0	3554
256	1000	1	3413
1024	10	0	33131
1024	10	1	28233
1024	100	0	14932
1024	100	1	8421
1024	1000	0	2364
1024	1000	1	1626

There are two scenarios where MVCCScan's throughput was higher, and seven scenarios where the low-level iterator's throughput was higher.

The above results are one trial only. For a handful of them I tried running a second time to check stability. I always observed very similar results.

ajkr commented 5 years ago

Focusing in on a scenario with a very wide perf gap (value size 1024, scan size 100), here's some performance measurements.

CPU profiles (for functions consuming >1% cycles, not including child calls) no batching:

+   24.72%  pebble   [kernel.kallsyms]    [k] copy_user_enhanced_fast_string                                                                                                                                                                                                    
+    7.62%  pebble   pebble               [.] rocksdb::crc32c::crc32c_3way                                                                                                                                                                                                      
+    3.58%  pebble   [kernel.kallsyms]    [k] find_get_entry                                                                                                                                                                                                                    
+    3.07%  pebble   pebble               [.] cockroach::DBComparator::Compare                                                                                                                                                                                                  
+    2.45%  pebble   pebble               [.] runtime.casgstatus                                                                                                                                                                                                                
+    2.32%  pebble   pebble               [.] rocksdb::MergingIterator::Next                                                                                                                                                                                                    
+    2.18%  pebble   libc-2.27.so         [.] __memcmp_avx2_movbe                                                                                                                                                                                                               
+    2.11%  pebble   [kernel.kallsyms]    [k] generic_file_read_iter                                                                                                                                                                                                            
+    2.02%  pebble   pebble               [.] rocksdb::DataBlockIter::Next                                                                                                                                                                                                      
+    1.67%  pebble   pebble               [.] rocksdb::IndexBlockIter::Seek                                                                                                                                                                                                     
+    1.53%  pebble   [kernel.kallsyms]    [k] do_syscall_64                                                                                                                                                                                                                     
+    1.46%  pebble   pebble               [.] rocksdb::DBIter::FindNextUserEntryInternal                                                                                                                                                                                        
+    1.43%  pebble   [kernel.kallsyms]    [k] __radix_tree_lookup                                                                                                                                                                                                               
+    1.37%  pebble   pebble               [.] runtime.reentersyscall                                                                                                                                                                                                            
+    1.27%  pebble   pebble               [.] rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::FindGreaterOrEqual                                                                                                                                           
+    1.25%  pebble   pebble               [.] runtime.cgoIsGoPointer                                                                                                                                                                                                            
+    1.12%  pebble   pebble               [.] runtime.exitsyscallfast                                                                                                                                                                                                           
+    1.10%  pebble   pebble               [.] rocksdb::DBIter::Next                                                                                                                                                                                                             
+    1.05%  pebble   pebble               [.] rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward

batching:

+   19.13%  pebble   pebble               [.] runtime.scanobject                                                                                                                                                                                                                
+   11.56%  pebble   [kernel.kallsyms]    [k] copy_user_enhanced_fast_string                                                                                                                                                                                                    
+    7.00%  pebble   pebble               [.] runtime.greyobject                                                                                                                                                                                                                
+    5.33%  pebble   pebble               [.] runtime.memmove                                                                                                                                                                                                                   
+    4.52%  pebble   pebble               [.] runtime.findObject                                                                                                                                                                                                                
+    4.08%  pebble   pebble               [.] rocksdb::crc32c::crc32c_3way                                                                                                                                                                                                      
+    2.51%  pebble   libc-2.27.so         [.] __memmove_avx_unaligned_erms                                                                                                                                                                                                      
+    1.71%  pebble   [kernel.kallsyms]    [k] generic_file_read_iter                                                                                                                                                                                                            
+    1.70%  pebble   pebble               [.] runtime.gcDrain                                                                                                                                                                                                                   
+    1.66%  pebble   pebble               [.] cockroach::DBComparator::Compare                                                                                                                                                                                                  
+    1.33%  pebble   libc-2.27.so         [.] __memcmp_avx2_movbe                                                                                                                                                                                                               
+    1.33%  pebble   [kernel.kallsyms]    [k] find_get_entry                                                                                                                                                                                                                    
+    1.19%  pebble   pebble               [.] rocksdb::MergingIterator::Next

Looks like GC overhead is the main extra cost in the batching variant. I wonder why.

petermattis commented 5 years ago

This is super surprising. Thanks for the reproduction steps. I'm going to reproduce myself and see if anything jumps out at me.

ajkr commented 5 years ago

One thing I noticed is the GC was running every 4ms. When GOGC=100, live heap size is ~5MB as in this benchmark, and each scan allocates ~100KB, we'd expect GC to kick in every ~50 scans. Throughput we're seeing is 8K QPS, which means 50 scans happen in ~6ms.

I guess this problem is amplified in the repro since the live heap is so small. But, I also wonder if there's something we should do to reuse allocations rather than creating a new one every scan.

petermattis commented 5 years ago

I think there is a slight apples-to-oranges comparison going on. The iterator code path is not copying the keys and values, while the batch scan path is. The following approximately rectifies this:

diff --git a/cmd/pebble/ycsb.go b/cmd/pebble/ycsb.go
index d7915cd..77d8dd9 100644
--- a/cmd/pebble/ycsb.go
+++ b/cmd/pebble/ycsb.go
@@ -15,6 +15,7 @@ import (

        "github.com/petermattis/pebble"
        "github.com/petermattis/pebble/internal/ackseq"
+       "github.com/petermattis/pebble/internal/bytealloc"
        "github.com/petermattis/pebble/internal/randvar"
        "github.com/spf13/cobra"
        "golang.org/x/exp/rand"
@@ -425,8 +426,11 @@ func (y *ycsb) scan(db DB, rng *rand.Rand) {
                        log.Fatal(err)
                }
        } else {
+               var data bytealloc.A
                iter := db.NewIter(nil)
                for valid := iter.SeekGE([]byte(key)); valid; valid = iter.Next() {
+                       data, _ = data.Copy(iter.Key())
+                       data, _ = data.Copy(iter.Value())
                        i++
                        if i >= int(count) {
                                break

I did a quick experiment and this seemed to make the batch scan always faster.

petermattis commented 5 years ago

One thing I noticed is the GC was running every 4ms. When GOGC=100, live heap size is ~5MB as in this benchmark, and each scan allocates ~100KB, we'd expect GC to kick in every ~50 scans. Throughput we're seeing is 8K QPS, which means 50 scans happen in ~6ms.

I guess this problem is amplified in the repro since the live heap is so small. But, I also wonder if there's something we should do to reuse allocations rather than creating a new one every scan.

This would be possible in the benchmark scenario, but hard in CockroachDB. The problem is that we're passing the scanned data as an RPC response. When that response is actually marshaled and sent is a bit tricky to determine.

ajkr commented 5 years ago

Does the iterator's client need to get a copy of the whole range, or can they hold an iterator and look at elements one by one? I thought the copy was done for passing data over the CGo barrier; I did not know the client required it.

EDIT: Never mind, your next comment about needing to pass the scanned data as an RPC response answers my question. Then I should add that allocation logic to pebble ycsb both when measuring RocksDB and Pebble.

petermattis commented 5 years ago

Does the iterator's client need to get a copy of the whole range, or can they hold an iterator and look at elements one by one? I thought the copy was done for passing data over the CGo barrier; I did not know the client required it.

For MVCCScan, we're storing the scanned data in an RPC response. That requires a copy of the data.

Here are numbers with the patch above for 1024 byte value sizes. No-batch:

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    48.7s         100000         2052.9       205286.2      0.5      0.2      1.8      7.3     71.3

Batching (a.k.a MVCCScan):

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    18.9s         100000         5302.8       530275.5      0.2      0.2      0.3      0.6     60.8

petermattis commented 5 years ago

EDIT: Never mind, your next comment about needing to pass the scanned data as an RPC response answers my question. Then I should add that allocation logic to pebble ycsb both when measuring RocksDB and Pebble.

Yep. Note mvccForwardScan in cmd/pebble/mvcc.go already did that.

petermattis commented 5 years ago

Here's what I see with 256 byte values:

RocksDB iter
./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   130.1s        1000000         7686.9       768685.6      0.1      0.1      0.2      0.3      1.8

RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    97.0s        1000000        10304.8      1030480.1      0.1      0.1      0.2      0.2      7.6

Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    30.6s        1000000        32642.0      3264198.3      0.0      0.0      0.0      0.1      5.8

This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:

        Merger: &pebble.Merger{
            Merge: pebble.DefaultMerger.Merge,
            Name:  "cockroach_merge_operator",
        },

Here are some more runs with different scan sizes on my laptop:


RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   429.4s        1000000         2328.9      2328875.0      0.4      0.4      0.6      0.8     14.2

Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   220.8s        1000000         4528.1      4528089.6      0.2      0.2      0.3      0.5     11.0

RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    60.2s        1000000        16609.8       166097.7      0.1      0.1      0.1      0.1      1.6

Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    12.9s        1000000        77232.4       772323.5      0.0      0.0      0.0      0.0      7.9

RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    42.4s        1000000        23594.0        23594.0      0.0      0.0      0.1      0.1      1.0

Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100    11.3s        1000000        88818.6        88818.6      0.0      0.0      0.0      0.0      3.1

Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.

The above numbers as a chart:

Screen Shot 2019-06-09 at 4 01 06 PM

Screen Shot 2019-06-09 at 4 05 03 PM

The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).

ajkr commented 5 years ago

Nice. I was seeing similarly great results when not copying into a bytealloc.A in the pebble scan. To clarify do your results copy into a bytealloc.A? I haven't had a chance to try again after learning we need to make that copy (on a plane now).

On Sun, Jun 9, 2019, 1:07 PM Peter Mattis notifications@github.com wrote:

Here's what I see with 256 byte values:

RocksDB iter ./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8

This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:
  Merger: &pebble.Merger{
      Merge: pebble.DefaultMerger.Merge,
      Name:  "cockroach_merge_operator",
  },
Here are some more runs with different scan sizes on my laptop:

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1

Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.

The above numbers as a chart:

[image: Screen Shot 2019-06-09 at 4 01 06 PM] https://user-images.githubusercontent.com/6201335/59163485-cf29d780-8acf-11e9-88a8-bef78c4215a7.png

[image: Screen Shot 2019-06-09 at 4 05 03 PM] https://user-images.githubusercontent.com/6201335/59163530-5ecf8600-8ad0-11e9-9da8-c9397fab60b8.png

The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXS3ZLOZN3O7H2D6CLPZVPGHA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIRFJI#issuecomment-500241061, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEPCSVPX3MRT4NRCLVM4ZLPZVPGHANCNFSM4HWIGFHQ .

ajkr commented 5 years ago

Oh also since it looks like you're using my experiment branch - I noticed rocksdb is not using block cache and pebble is using 1GB.

On Sun, Jun 9, 2019, 1:14 PM Andrew Kryczka andrew.kryczka2@gmail.com wrote:

Nice. I was seeing similarly great results when not copying into a bytealloc.A in the pebble scan. To clarify do your results copy into a bytealloc.A? I haven't had a chance to try again after learning we need to make that copy (on a plane now).

On Sun, Jun 9, 2019, 1:07 PM Peter Mattis notifications@github.com wrote:
Here's what I see with 256 byte values:

RocksDB iter ./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8

This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:
 Merger: &pebble.Merger{
     Merge: pebble.DefaultMerger.Merge,
     Name:  "cockroach_merge_operator",
 },
Here are some more runs with different scan sizes on my laptop:

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1

Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.

The above numbers as a chart:

[image: Screen Shot 2019-06-09 at 4 01 06 PM] https://user-images.githubusercontent.com/6201335/59163485-cf29d780-8acf-11e9-88a8-bef78c4215a7.png

[image: Screen Shot 2019-06-09 at 4 05 03 PM] https://user-images.githubusercontent.com/6201335/59163530-5ecf8600-8ad0-11e9-9da8-c9397fab60b8.png

The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXS3ZLOZN3O7H2D6CLPZVPGHA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIRFJI#issuecomment-500241061, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEPCSVPX3MRT4NRCLVM4ZLPZVPGHANCNFSM4HWIGFHQ .

petermattis commented 5 years ago

Yes, this was using a copy for Pebble. The lack of block cache in RocksDB would definitely affect the smaller scan sizes. Let me take a look.

On Sun, Jun 9, 2019 at 4:23 PM Andrew Kryczka notifications@github.com wrote:

Oh also since it looks like you're using my experiment branch - I noticed rocksdb is not using block cache and pebble is using 1GB.

On Sun, Jun 9, 2019, 1:14 PM Andrew Kryczka andrew.kryczka2@gmail.com wrote:

Nice. I was seeing similarly great results when not copying into a bytealloc.A in the pebble scan. To clarify do your results copy into a bytealloc.A? I haven't had a chance to try again after learning we need to make that copy (on a plane now).

On Sun, Jun 9, 2019, 1:07 PM Peter Mattis notifications@github.com wrote:

Here's what I see with 256 byte values:

RocksDB iter ./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8

This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:

Merger: &pebble.Merger{ Merge: pebble.DefaultMerger.Merge, Name: "cockroach_merge_operator", },

Here are some more runs with different scan sizes on my laptop:

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9

RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0

Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1

__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1

Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.

The above numbers as a chart:

[image: Screen Shot 2019-06-09 at 4 01 06 PM] < https://user-images.githubusercontent.com/6201335/59163485-cf29d780-8acf-11e9-88a8-bef78c4215a7.png

[image: Screen Shot 2019-06-09 at 4 05 03 PM] < https://user-images.githubusercontent.com/6201335/59163530-5ecf8600-8ad0-11e9-9da8-c9397fab60b8.png

The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXS3ZLOZN3O7H2D6CLPZVPGHA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIRFJI#issuecomment-500241061 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABEPCSVPX3MRT4NRCLVM4ZLPZVPGHANCNFSM4HWIGFHQ

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABPJ75YJCZ3ATZWRLEVX4VDPZVRFXA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIROLI#issuecomment-500242221, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPJ753BFSXIIVL5AV7A22TPZVRFXANCNFSM4HWIGFHQ .

petermattis commented 5 years ago

Unsurprisingly, RocksDB looks a lot better when configured to use a cache:

Screen Shot 2019-06-09 at 5 59 49 PM

Screen Shot 2019-06-09 at 5 59 58 PM

petermattis commented 5 years ago

Here is a comparison where RocksDB (batch) is the baseline, showing the speed up / slow down for Pebble and RocksDB (iter) performance:

Screen Shot 2019-06-09 at 6 07 06 PM

petermattis commented 5 years ago

@ajkr The experiment methodology has a problem, due to a confusing aspect of the pebble ycsb command. When running with --workload scan=100, the start key for the scan is chosen from a zipf distribution in the range [1,<initial-keys>]. The <initial-keys> value is 10000 (the --initial-keys flag). If we bump up that flag, the scan=100 workload will rewrite that many keys. The effect this was having was that 10000 keys were being inserted before starting the scan workload, causing a memtable to be populated. I'm not sure what effect this will have on the performance numbers, but I'm guessing this wasn't what was intended.

ajkr commented 5 years ago

OK, RocksDB upgrade is done, now back to this. There are a couple things that have been stuck in my mind about this comparison for the past few weeks.

(1) When benchmarking RocksDB the Go heap is tiny (5-10MB), which, with GOGC=100, caused GC run very frequently. I wonder what is a realistic live heap size, and we can adjust GOGC upwards to simulate it. (2) The amount of data copying seems excessive. I will try prototyping some things to see if it's avoidable. I think it'd be great if we can not allocate any extra memory for keys/values and use PinnableSlice to read keys/values from the raw block cache memory, even from Go. This may be difficult.

petermattis commented 5 years ago

The amount of data copying seems excessive. I will try prototyping some things to see if it's avoidable. I think it'd be great if we can not allocate any extra memory for keys/values and use PinnableSlice to read keys/values from the raw block cache memory, even from Go. This may be difficult.

I've looked at this in the past. This is extremely difficult to do because we need to pass pointers for each key/value which is itself expensive. Happy to hear a new idea, though, which makes this feasible.

ajkr commented 5 years ago

I don't have a new idea but was assuming passing pointers is cheaper than passing whole keys/values. Maybe I'm missing something; let's see.

petermattis commented 5 years ago

I don't have a new idea but was assuming passing pointers is cheaper than passing whole keys/values. Maybe I'm missing something; let's see.

If you're scanning 1000 key/values, you have to pass some structure from Go to C that can hold those 2x1000 pointers. After the Cgo call, you need to copy the data into Go memory to be sent back in the RPC response. Finally you have to make another Cgo call to release the C memory (i.e. the PinnableSlices).

Q: Does RocksDB allow pinning the memory for a key without copying? With prefix compression, I would think that a copy of the key would be needed one way or another.

ajkr commented 5 years ago

I don't have a new idea but was assuming passing pointers is cheaper than passing whole keys/values. Maybe I'm missing something; let's see.

If you're scanning 1000 key/values, you have to pass some structure from Go to C that can hold those 2x1000 pointers. After the Cgo call, you need to copy the data into Go memory to be sent back in the RPC response. Finally you have to make another Cgo call to release the C memory (i.e. the PinnableSlices).

I don't understand protobuf or Go well enough right now. I have some vision of setting pointers into pinned block cache blocks in the protobuf response. Then I envision the serialization seeing those pointers and passing the pointed-to buffers directly into send(). That way there's no copying out of block cache. I'm probably about to learn many reasons why that's impossible.

EDIT: I do see the need for a Go allocation for the pointers, but I do not see the need for a copy of the data into Go memory.

Q: Does RocksDB allow pinning the memory for a key without copying? With prefix compression, I would think that a copy of the key would be needed one way or another.

Yes that is true. ReadOptions::pin_data does not pin keys when delta encoding is enabled. So it'd require either turning it off or only using pointers to values.

petermattis commented 5 years ago

EDIT: I do see the need for a Go allocation for the pointers, but I do not see the need for a copy of the data into Go memory.

The underlying C++ memory needs to be released. When does that release happen? Because Go is garbage collected we don't have a good signal for when it is safe to do so. Using a finalizer is not a good option as finalizers are not guaranteed to run in any specific time frame. And the running of finalizers is also done by a single goroutine. Having that goroutine make calls into C++ to free the memory seems like we would be venturing into unexplored territory.

ajkr commented 5 years ago

Right, that is another unsolved problem. It gets worse because with ReadOptions::pin_data the pinned keys'/values' lifetime is tied to the iterator's lifetime. So could MVCCScan create a new iterator each time whose lifetime somehow ends after the protobuf response gets serialized? And if so does it improve performance? Those are questions I'm hoping to try answering.

petermattis commented 5 years ago

Right, that is another unsolved problem. It gets worse because with ReadOptions::pin_data the pinned keys'/values' lifetime is tied to the iterator's lifetime. So could MVCCScan create a new iterator each time whose lifetime somehow ends after the protobuf response gets serialized? And if so does it improve performance? Those are questions I'm hoping to try answering.

To some degree, not needing to answer these questions is one of Pebble's strengths. It is interesting that the prior work on moving MVCCScan to C++ provided such a large improvement that is still visible in the perf numbers. That work took ~2 months (from start to productionized). The initial experimentation took ~1 month. You might expect something similar here.

ajkr commented 5 years ago

Got it. How about we put a time limit on how long I investigate whether it's possible, and if so, propose a plan? I don't want to spend three months on it either. But if we're using RocksDB in a suboptimal way and the effects are noticeable, I don't see why not look into improving it.

ajkr commented 5 years ago

My proposal would be to see what results a prototype can get by next Tuesday.

petermattis commented 5 years ago

My proposal would be to see what results a prototype can get by next Tuesday.

Ack, that seems like a reasonable time bound.

PS The total time on the C++ version of MVCCScan was ~2 months: ~1 month experimenting + ~1 month productionizing.

petermattis commented 5 years ago

Rather than trying to remove copies in the existing Cgo interface, we can make the comparison to Pebble more apples-to-apples by adding extra copies to the Pebble scan code. This is much easier to do. I made a quick attempt of that tonight. The results were a less than 3% slowdown, which might have been in the noise or I might have completely mucked something up. This was on my laptop.

ajkr commented 5 years ago

That's a good idea. I won't have time to try for a few hours. But iirc the CPU profile showed a large percent of CPU time in allocation, copying, and GC, so I'm surprised that the potential gains by reducing these are limited to 3%.

On Tue, Jul 2, 2019, 6:01 PM Peter Mattis notifications@github.com wrote:

Rather than trying to remove copies in the existing Cgo interface, we can make the comparison to Pebble more apples-to-apples by adding extra copies to the Pebble scan code. This is much easier to do. I made a quick attempt of that tonight. The results were a less than 3% slowdown, which might have been in the noise or I might have completely mucked something up. This was on my laptop.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXFBNSSZNMQMIMT5O3P5P27BA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZC6OKQ#issuecomment-507897642, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEPCSSV7KBSM5SMNDMNT53P5P27BANCNFSM4HWIGFHQ .

ajkr commented 5 years ago

There's a profile above that was taken with batching (https://github.com/cockroachdb/cockroach/issues/38116#issuecomment-500168760). More than 50% of CPU time spent in copying/allocation/GC - too large to ignore, IMO.

Code I'm using: https://github.com/ajkr/pebble/tree/experiment-scripts. Also a couple tiny changes in libroach: (1) set level0_file_num_compaction_trigger=1, and (2) treat '\0'-suffixed key as a value not a metadata key.

Here is a quick experiment showing the impact of GOGC on benchmark results.

setup: ./bin/roachprod create andrewk-bench --clouds aws -n 1 --aws-machine-type-ssd c5d.4xlarge --local-ssd --local-ssd-no-ext4-barrier setup: ./pebble ycsb bench --wipe --use-rocksdb --workload insert=100 --values 64 --num-ops 10000000 --concurrency 256 --duration 0s --wait-compactions bench: GOGC=X /usr/bin/time ./pebble ycsb bench [ --use-rocksdb ] [ --use-batch-scan ] --workload scan=100 --duration 2m --initial-keys 0 --prepopulated-keys 10000000 --keys uniform --concurrency 10 --batch 100

engine	GOGC	ops/sec	peak RSS (MB)
rocksdb (iter)	100	61542	1250
rocksdb (iter)	1000	85454	2144
rocksdb (batch)	100	183294	1391
rocksdb (batch)	1000	209515	2068
pebble	100	205100	2012

To conclude: I think the benchmark results we've seen on RocksDB are affected by the tiny live heap size and default value of GOGC, which together cause unusually heavy GC activity. Perhaps a more apples-to-apples comparison could be setting the GOGC in the Rocks benchmark such that peak RSS matches the Pebble benchmark.

petermattis commented 5 years ago

@ajkr This is interesting. Btw, https://github.com/petermattis/pebble/pull/172 should improve the Pebble numbers under concurrency. It has a dramatic affect on the 16 core machine I've been testing on. Not sure of the impact on a 4 core machine which is what I believe you've been using.

More than 50% of CPU time spent in copying/allocation/GC - too large to ignore, IMO.

I haven't seen that in profiles I've looked at. Let me check that again.

Perhaps a more apples-to-apples comparison could be setting the GOGC in the Rocks benchmark such that peak RSS matches the Pebble benchmark.

I'm not sure matching peak RSS achieves an apples-to-apples comparison. Futzing with GOGC can definitely have an impact on performance, but such a massive difference in settings could actually tilt the benchmarks in favor of RocksDB. Let me think about this more, though.

--batch 100

The --batch flag doesn't affect the number of keys scanned. You need the --scans flag for that. I think that means you were testing with the default zipf:1-1000 which will mostly be small scans. Was that intentional?

petermattis commented 5 years ago

treat '\0'-suffixed key as a value not a metadata key.

Btw, rather than doing this, we should append a value timestamp to the key and the scan at a higher timestamp. Something like:

func (y *ycsb) makeKey(keyNum uint64) []byte {
    key := make([]byte, 4, 24+10)
    copy(key, "user")
    key = strconv.AppendUint(key, y.hashKey(keyNum), 10)
    // Use the MVCC encoding for keys. This appends a timestamp with
    // walltime=1,logical=0.
    key = append(key, '\x00', '\x00', '\x00', '\x00', '\x00',
        '\x00', '\x00', '\x00', '\x01', '\x08')
    return key
}

This didn't affect perf, but avoids the need to change MVCCScan.

petermattis commented 5 years ago

Attached our two profiles from running ./pebble ycsb /mnt/data1/bench --rocksdb --use-batch-scan --workload scan=100 --duration 30s --initial-keys 0 --prepopulated-keys 10000000 --keys uniform --scans 1 on a 16 core machine. The first (profile.n=1.pdf) was run with --concurrency=1 and the second (profile.n=10.pdf) was run with --concurrency=10. The time spent in C++ is 87% and 80% respectively. GC time does not appear to be significant. Neither does that takes place in Go. That copying takes place in copyFromSliceVector:

         .          .   2480:   // Interpret the C pointer as a pointer to a Go array, then slice.
         .          .   2481:   slices := (*[1 << 20]C.DBSlice)(unsafe.Pointer(bufs))[:len:len]
         .          .   2482:   neededBytes := 0
      10ms       10ms   2483:   for i := range slices {
         .          .   2484:       neededBytes += int(slices[i].len)
         .          .   2485:   }
         .      350ms   2486:   data := nonZeroingMakeByteSlice(neededBytes)[:0]
      10ms       10ms   2487:   for i := range slices {
         .          .   2488:       data = append(data, cSliceToUnsafeGoBytes(slices[i])...)
         .          .   2489:   }
         .          .   2490:   return data
         .          .   2491:}

Note that the profile was gathered over 30s, so these timings are completely inconsequential. Note that this was for a scan size of 1. I'm not terribly surprised by this as the data size being copied is ~100 bytes. This is copyFromSliceVector with a scan size of 1000:

         .          .   2480:   // Interpret the C pointer as a pointer to a Go array, then slice.
         .          .   2481:   slices := (*[1 << 20]C.DBSlice)(unsafe.Pointer(bufs))[:len:len]
         .          .   2482:   neededBytes := 0
      10ms       10ms   2483:   for i := range slices {
         .          .   2484:       neededBytes += int(slices[i].len)
         .          .   2485:   }
         .      260ms   2486:   data := nonZeroingMakeByteSlice(neededBytes)[:0]
      10ms       10ms   2487:   for i := range slices {
         .      720ms   2488:       data = append(data, cSliceToUnsafeGoBytes(slices[i])...)
         .          .   2489:   }
         .          .   2490:   return data
         .          .   2491:}

Now we actually see the copy showing up, but overall this routine is consuming ~1s of CPU out of a 30s run. I'd expect the copying of the key/values on the C++ side to have a similar impact.

jordanlewis commented 5 years ago

Really excited about these experiments you're doing. I'd like to put in a plug for making sure that the scenarios under test include some that are similar to those we get when doing large table scans in CockroachDB.

The maximum number of keys we read at a time is 10,000. If we assume a row size of, say, 100 bytes (16 byte uuid, a couple 8 byte ints, and then a string column or two), we're looking at ~ 1 MB scan all told. Making sure we're testing scans of that magnitude would go a long way toward convincingly creating a comparison between the two storage engines that would hold up to the real world scenarios that the SQL execution engine cares about (at least for workloads that aren't completely pure OLTP).

ajkr commented 5 years ago

Attached our two profiles from running ./pebble ycsb /mnt/data1/bench --rocksdb --use-batch-scan --workload scan=100 --duration 30s --initial-keys 0 --prepopulated-keys 10000000 --keys uniform --scans 1 on a 16 core machine. The first (profile.n=1.pdf) was run with --concurrency=1 and the second (profile.n=10.pdf) was run with --concurrency=10. The time spent in C++ is 87% and 80% respectively. GC time does not appear to be significant. Neither does that takes place in Go. That copying takes place in copyFromSliceVector:

Unfortunately I don't have time to collect another profile using perf right now but will do so later (there is one above but it's a bit stale at this point). But even lacking that, does the fact that tuning GOGC to run less frequently caused a 10+% increase in ops/sec imply that GC is significant?

The --batch flag doesn't affect the number of keys scanned. You need the --scans flag for that. I think that means you were testing with the default zipf:1-1000 which will mostly be small scans. Was that intentional?

Right I fixed that and set --scans 100 on the server but forgot to copy/paste the changed command here.

petermattis commented 5 years ago

Unfortunately I don't have time to collect another profile using perf right now but will do so later (there is one above but it's a bit stale at this point). But even lacking that, does the fact that tuning GOGC to run less frequently caused a 10+% increase in ops/sec imply that GC is significant?

Yes, that implies that GC is significant. I've seen a similar level of perf improvement when running CockroachDB with a larger GOGC setting, yet we can't run like that in production. I'm also not seeing the same level of improvement when running on 16 core machines. I might try testing on 4 core machines and see if I can notice this as well. Something I also looked at was running with GODEBUG=gctrace=1, and the resulting log messages indicated that GC was small single digit percentage of time.

petermattis commented 5 years ago

Btw, it is interesting that GOGC is helping the RocksDB numbers. My intuition is that the GC would be penalizing Pebble as it has so much more memory in the Go heap which will make GC more expensive.

ajkr commented 5 years ago

My proposal would be to see what results a prototype can get by next Tuesday.

Ack, that seems like a reasonable time bound.

PS The total time on the C++ version of MVCCScan was ~2 months: ~1 month experimenting + ~1 month productionizing.

Time is up. I did not get a working prototype in Cockroach. Will write down what I learned and tried anyways in case we decide to pick this up again.

To serve an MVCCScan, the following copies/allocations are required (ignoring those beneath block cache inside the storage engine):

When block_restart_interval > 1, the BlockIter uses IterKey::TrimAppend() for reading a prefix-encoded key. It copies part of the key and reuses an allocation, enlarging it when needed.
When block_restart_interval > 1 or !ReadOptions::pin_data, DBIter copies the current key into an IterKey before advancing its internal iterator. Again this reuses a buffer, enlarging it when needed.
libroach's MVCCScan() creates a chunkedBuffer as it scans, copying each key and value into it. chunkedBuffer has a nice property that it can be expanded without moving existing data.
In Go, rocksDBIterator::MVCCScan() copies all the chunkedBuffer data into a Go-allocated buffer.
In Go, protobuf Marshal()ing creates a buffer for the full response and copies the above buffer into it (the mechanism differs depending on ScanFormat, but either way essentially the same amount of data is copied).

So it seems like a lot, and the Go allocations were particularly inconvenient due to the unexpected GC assist they incurred. In an attempt to eliminate all the above allocations/copies, I wanted to try passing around pointers to keys/values in block cache all the way down to writev(), such that key/value data is only read and coalesced into a response in the kernel. That'd require the following changes, only some of which are possible today:

Set block_restart_interval = 1. That disables key prefix compression, increasing the on-disk size of sstables and in-memory size of blocks in block cache.
Set ReadOptions::pin_data = true. That pins all data blocks that are accessed during an iterator's lifetime such that we can hold pointers to the keys/values even as the iterator moves. This would prevent using a pool of iterators otherwise the pinned block count could grow infinitely.
Use a serialization/RPC system that supports pointers to external data buffers in its messages.

The first two changes are easy for prototyping, although their consequences are worrisome. The third, however, I could only find in Cap'nproto in C++. I did use it to write a toy C++ client/server for scanning RocksDB data, and verified it really does pass pointers to keys/values in block cache all the way down to writev(). That code can be found here: https://gist.github.com/ajkr/a0b7d2a8b7cf1ad24d65338f04731a30

I can think of a few possible next steps if we decide to pick this up again.

Evaluate the performance of that toy C++ client/server compared to our gRPC/protobuf based MVCCScan, either using RocksDB or Pebble.
The Go capnproto appears to not use writev (not 100% sure here), and it doesn't have gRPC integration. We can watch for it in the future or help out.
Try another approach to reduce allocations/copying incrementally, rather than trying to eliminate all of them at once.

ajkr commented 5 years ago

The original goal of this task is resolved, so closing.