Closed ajkr closed 5 years ago
Focusing in on a scenario with a very wide perf gap (value size 1024, scan size 100), here's some performance measurements.
CPU profiles (for functions consuming >1% cycles, not including child calls) no batching:
+ 24.72% pebble [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 7.62% pebble pebble [.] rocksdb::crc32c::crc32c_3way
+ 3.58% pebble [kernel.kallsyms] [k] find_get_entry
+ 3.07% pebble pebble [.] cockroach::DBComparator::Compare
+ 2.45% pebble pebble [.] runtime.casgstatus
+ 2.32% pebble pebble [.] rocksdb::MergingIterator::Next
+ 2.18% pebble libc-2.27.so [.] __memcmp_avx2_movbe
+ 2.11% pebble [kernel.kallsyms] [k] generic_file_read_iter
+ 2.02% pebble pebble [.] rocksdb::DataBlockIter::Next
+ 1.67% pebble pebble [.] rocksdb::IndexBlockIter::Seek
+ 1.53% pebble [kernel.kallsyms] [k] do_syscall_64
+ 1.46% pebble pebble [.] rocksdb::DBIter::FindNextUserEntryInternal
+ 1.43% pebble [kernel.kallsyms] [k] __radix_tree_lookup
+ 1.37% pebble pebble [.] runtime.reentersyscall
+ 1.27% pebble pebble [.] rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::FindGreaterOrEqual
+ 1.25% pebble pebble [.] runtime.cgoIsGoPointer
+ 1.12% pebble pebble [.] runtime.exitsyscallfast
+ 1.10% pebble pebble [.] rocksdb::DBIter::Next
+ 1.05% pebble pebble [.] rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward
batching:
+ 19.13% pebble pebble [.] runtime.scanobject
+ 11.56% pebble [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 7.00% pebble pebble [.] runtime.greyobject
+ 5.33% pebble pebble [.] runtime.memmove
+ 4.52% pebble pebble [.] runtime.findObject
+ 4.08% pebble pebble [.] rocksdb::crc32c::crc32c_3way
+ 2.51% pebble libc-2.27.so [.] __memmove_avx_unaligned_erms
+ 1.71% pebble [kernel.kallsyms] [k] generic_file_read_iter
+ 1.70% pebble pebble [.] runtime.gcDrain
+ 1.66% pebble pebble [.] cockroach::DBComparator::Compare
+ 1.33% pebble libc-2.27.so [.] __memcmp_avx2_movbe
+ 1.33% pebble [kernel.kallsyms] [k] find_get_entry
+ 1.19% pebble pebble [.] rocksdb::MergingIterator::Next
Looks like GC overhead is the main extra cost in the batching variant. I wonder why.
This is super surprising. Thanks for the reproduction steps. I'm going to reproduce myself and see if anything jumps out at me.
One thing I noticed is the GC was running every 4ms. When GOGC=100
, live heap size is ~5MB as in this benchmark, and each scan allocates ~100KB, we'd expect GC to kick in every ~50 scans. Throughput we're seeing is 8K QPS, which means 50 scans happen in ~6ms.
I guess this problem is amplified in the repro since the live heap is so small. But, I also wonder if there's something we should do to reuse allocations rather than creating a new one every scan.
I think there is a slight apples-to-oranges comparison going on. The iterator code path is not copying the keys and values, while the batch scan path is. The following approximately rectifies this:
diff --git a/cmd/pebble/ycsb.go b/cmd/pebble/ycsb.go
index d7915cd..77d8dd9 100644
--- a/cmd/pebble/ycsb.go
+++ b/cmd/pebble/ycsb.go
@@ -15,6 +15,7 @@ import (
"github.com/petermattis/pebble"
"github.com/petermattis/pebble/internal/ackseq"
+ "github.com/petermattis/pebble/internal/bytealloc"
"github.com/petermattis/pebble/internal/randvar"
"github.com/spf13/cobra"
"golang.org/x/exp/rand"
@@ -425,8 +426,11 @@ func (y *ycsb) scan(db DB, rng *rand.Rand) {
log.Fatal(err)
}
} else {
+ var data bytealloc.A
iter := db.NewIter(nil)
for valid := iter.SeekGE([]byte(key)); valid; valid = iter.Next() {
+ data, _ = data.Copy(iter.Key())
+ data, _ = data.Copy(iter.Value())
i++
if i >= int(count) {
break
I did a quick experiment and this seemed to make the batch scan always faster.
One thing I noticed is the GC was running every 4ms. When GOGC=100, live heap size is ~5MB as in this benchmark, and each scan allocates ~100KB, we'd expect GC to kick in every ~50 scans. Throughput we're seeing is 8K QPS, which means 50 scans happen in ~6ms.
I guess this problem is amplified in the repro since the live heap is so small. But, I also wonder if there's something we should do to reuse allocations rather than creating a new one every scan.
This would be possible in the benchmark scenario, but hard in CockroachDB. The problem is that we're passing the scanned data as an RPC response. When that response is actually marshaled and sent is a bit tricky to determine.
Does the iterator's client need to get a copy of the whole range, or can they hold an iterator and look at elements one by one? I thought the copy was done for passing data over the CGo barrier; I did not know the client required it.
EDIT: Never mind, your next comment about needing to pass the scanned data as an RPC response answers my question. Then I should add that allocation logic to pebble ycsb
both when measuring RocksDB and Pebble.
Does the iterator's client need to get a copy of the whole range, or can they hold an iterator and look at elements one by one? I thought the copy was done for passing data over the CGo barrier; I did not know the client required it.
For MVCCScan, we're storing the scanned data in an RPC response. That requires a copy of the data.
Here are numbers with the patch above for 1024 byte value sizes. No-batch:
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 48.7s 100000 2052.9 205286.2 0.5 0.2 1.8 7.3 71.3
Batching (a.k.a MVCCScan):
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 18.9s 100000 5302.8 530275.5 0.2 0.2 0.3 0.6 60.8
EDIT: Never mind, your next comment about needing to pass the scanned data as an RPC response answers my question. Then I should add that allocation logic to pebble ycsb both when measuring RocksDB and Pebble.
Yep. Note mvccForwardScan
in cmd/pebble/mvcc.go
already did that.
Here's what I see with 256 byte values:
RocksDB iter
./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8
RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6
Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8
This is on my laptop. Not sure if that matters. I had to specify a Merger
for pebble to be ok with opening the RocksDB created database:
Merger: &pebble.Merger{
Merge: pebble.DefaultMerger.Merge,
Name: "cockroach_merge_operator",
},
Here are some more runs with different scan sizes on my laptop:
RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2
Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0
RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6
Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9
RocksDB MVCCScan
./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0
Pebble iter
./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1
____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1
Note that MVCCScan
has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.
The above numbers as a chart:
The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).
Nice. I was seeing similarly great results when not copying into a bytealloc.A in the pebble scan. To clarify do your results copy into a bytealloc.A? I haven't had a chance to try again after learning we need to make that copy (on a plane now).
On Sun, Jun 9, 2019, 1:07 PM Peter Mattis notifications@github.com wrote:
Here's what I see with 256 byte values:
RocksDB iter ./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8
This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:
Merger: &pebble.Merger{ Merge: pebble.DefaultMerger.Merge, Name: "cockroach_merge_operator", },
Here are some more runs with different scan sizes on my laptop:
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1
Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.
The above numbers as a chart:
[image: Screen Shot 2019-06-09 at 4 01 06 PM] https://user-images.githubusercontent.com/6201335/59163485-cf29d780-8acf-11e9-88a8-bef78c4215a7.png
[image: Screen Shot 2019-06-09 at 4 05 03 PM] https://user-images.githubusercontent.com/6201335/59163530-5ecf8600-8ad0-11e9-9da8-c9397fab60b8.png
The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXS3ZLOZN3O7H2D6CLPZVPGHA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIRFJI#issuecomment-500241061, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEPCSVPX3MRT4NRCLVM4ZLPZVPGHANCNFSM4HWIGFHQ .
Oh also since it looks like you're using my experiment branch - I noticed rocksdb is not using block cache and pebble is using 1GB.
On Sun, Jun 9, 2019, 1:14 PM Andrew Kryczka andrew.kryczka2@gmail.com wrote:
Nice. I was seeing similarly great results when not copying into a bytealloc.A in the pebble scan. To clarify do your results copy into a bytealloc.A? I haven't had a chance to try again after learning we need to make that copy (on a plane now).
On Sun, Jun 9, 2019, 1:07 PM Peter Mattis notifications@github.com wrote:
Here's what I see with 256 byte values:
RocksDB iter ./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8
This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:
Merger: &pebble.Merger{ Merge: pebble.DefaultMerger.Merge, Name: "cockroach_merge_operator", },
Here are some more runs with different scan sizes on my laptop:
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1 __optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1
Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.
The above numbers as a chart:
[image: Screen Shot 2019-06-09 at 4 01 06 PM] https://user-images.githubusercontent.com/6201335/59163485-cf29d780-8acf-11e9-88a8-bef78c4215a7.png
[image: Screen Shot 2019-06-09 at 4 05 03 PM] https://user-images.githubusercontent.com/6201335/59163530-5ecf8600-8ad0-11e9-9da8-c9397fab60b8.png
The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXS3ZLOZN3O7H2D6CLPZVPGHA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIRFJI#issuecomment-500241061, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEPCSVPX3MRT4NRCLVM4ZLPZVPGHANCNFSM4HWIGFHQ .
Yes, this was using a copy for Pebble. The lack of block cache in RocksDB would definitely affect the smaller scan sizes. Let me take a look.
On Sun, Jun 9, 2019 at 4:23 PM Andrew Kryczka notifications@github.com wrote:
Oh also since it looks like you're using my experiment branch - I noticed rocksdb is not using block cache and pebble is using 1GB.
On Sun, Jun 9, 2019, 1:14 PM Andrew Kryczka andrew.kryczka2@gmail.com wrote:
Nice. I was seeing similarly great results when not copying into a bytealloc.A in the pebble scan. To clarify do your results copy into a bytealloc.A? I haven't had a chance to try again after learning we need to make that copy (on a plane now).
On Sun, Jun 9, 2019, 1:07 PM Peter Mattis notifications@github.com wrote:
Here's what I see with 256 byte values:
RocksDB iter ./pebble ycsb bench --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 130.1s 1000000 7686.9 768685.6 0.1 0.1 0.2 0.3 1.8
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 97.0s 1000000 10304.8 1030480.1 0.1 0.1 0.2 0.2 7.6
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 30.6s 1000000 32642.0 3264198.3 0.0 0.0 0.0 0.1 5.8
This is on my laptop. Not sure if that matters. I had to specify a Merger for pebble to be ok with opening the RocksDB created database:
Merger: &pebble.Merger{ Merge: pebble.DefaultMerger.Merge, Name: "cockroach_merge_operator", },
Here are some more runs with different scan sizes on my laptop:
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 429.4s 1000000 2328.9 2328875.0 0.4 0.4 0.6 0.8 14.2
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1000
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 220.8s 1000000 4528.1 4528089.6 0.2 0.2 0.3 0.5 11.0
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 10
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 60.2s 1000000 16609.8 166097.7 0.1 0.1 0.1 0.1 1.6
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 10
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 12.9s 1000000 77232.4 772323.5 0.0 0.0 0.0 0.0 7.9
RocksDB MVCCScan ./pebble ycsb bench --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 1
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 42.4s 1000000 23594.0 23594.0 0.0 0.0 0.1 0.1 1.0
Pebble iter ./pebble ycsb bench --workload scan=100 --num-ops 1000000 --duration 0s --scans 1
__optypeelapsed__ops(total)ops/sec(cum)keys/sec(cum)avg(ms)p50(ms)p95(ms)__p99(ms)_pMax(ms) scan_100 11.3s 1000000 88818.6 88818.6 0.0 0.0 0.0 0.0 3.1
Note that MVCCScan has to do more than simply iterating over the keys, so this comparison to Pebble isn't quite fair.
The above numbers as a chart:
[image: Screen Shot 2019-06-09 at 4 01 06 PM] < https://user-images.githubusercontent.com/6201335/59163485-cf29d780-8acf-11e9-88a8-bef78c4215a7.png
[image: Screen Shot 2019-06-09 at 4 05 03 PM] < https://user-images.githubusercontent.com/6201335/59163530-5ecf8600-8ad0-11e9-9da8-c9397fab60b8.png
The first chart is showing ops/sec. The second keys/sec (ops/sec * scan-size).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXS3ZLOZN3O7H2D6CLPZVPGHA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIRFJI#issuecomment-500241061 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABEPCSVPX3MRT4NRCLVM4ZLPZVPGHANCNFSM4HWIGFHQ
.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABPJ75YJCZ3ATZWRLEVX4VDPZVRFXA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIROLI#issuecomment-500242221, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPJ753BFSXIIVL5AV7A22TPZVRFXANCNFSM4HWIGFHQ .
Unsurprisingly, RocksDB looks a lot better when configured to use a cache:
Here is a comparison where RocksDB (batch)
is the baseline, showing the speed up / slow down for Pebble and RocksDB (iter) performance:
@ajkr The experiment methodology has a problem, due to a confusing aspect of the pebble ycsb
command. When running with --workload scan=100
, the start key for the scan is chosen from a zipf distribution in the range [1,<initial-keys>]
. The <initial-keys>
value is 10000
(the --initial-keys
flag). If we bump up that flag, the scan=100
workload will rewrite that many keys. The effect this was having was that 10000 keys were being inserted before starting the scan workload, causing a memtable to be populated. I'm not sure what effect this will have on the performance numbers, but I'm guessing this wasn't what was intended.
OK, RocksDB upgrade is done, now back to this. There are a couple things that have been stuck in my mind about this comparison for the past few weeks.
(1) When benchmarking RocksDB the Go heap is tiny (5-10MB), which, with GOGC=100, caused GC run very frequently. I wonder what is a realistic live heap size, and we can adjust GOGC upwards to simulate it.
(2) The amount of data copying seems excessive. I will try prototyping some things to see if it's avoidable. I think it'd be great if we can not allocate any extra memory for keys/values and use PinnableSlice
to read keys/values from the raw block cache memory, even from Go. This may be difficult.
The amount of data copying seems excessive. I will try prototyping some things to see if it's avoidable. I think it'd be great if we can not allocate any extra memory for keys/values and use PinnableSlice to read keys/values from the raw block cache memory, even from Go. This may be difficult.
I've looked at this in the past. This is extremely difficult to do because we need to pass pointers for each key/value which is itself expensive. Happy to hear a new idea, though, which makes this feasible.
I don't have a new idea but was assuming passing pointers is cheaper than passing whole keys/values. Maybe I'm missing something; let's see.
I don't have a new idea but was assuming passing pointers is cheaper than passing whole keys/values. Maybe I'm missing something; let's see.
If you're scanning 1000 key/values, you have to pass some structure from Go to C that can hold those 2x1000 pointers. After the Cgo call, you need to copy the data into Go memory to be sent back in the RPC response. Finally you have to make another Cgo call to release the C memory (i.e. the PinnableSlices
).
Q: Does RocksDB allow pinning the memory for a key without copying? With prefix compression, I would think that a copy of the key would be needed one way or another.
I don't have a new idea but was assuming passing pointers is cheaper than passing whole keys/values. Maybe I'm missing something; let's see.
If you're scanning 1000 key/values, you have to pass some structure from Go to C that can hold those 2x1000 pointers. After the Cgo call, you need to copy the data into Go memory to be sent back in the RPC response. Finally you have to make another Cgo call to release the C memory (i.e. the
PinnableSlices
).
I don't understand protobuf or Go well enough right now. I have some vision of setting pointers into pinned block cache blocks in the protobuf response. Then I envision the serialization seeing those pointers and passing the pointed-to buffers directly into send()
. That way there's no copying out of block cache. I'm probably about to learn many reasons why that's impossible.
EDIT: I do see the need for a Go allocation for the pointers, but I do not see the need for a copy of the data into Go memory.
Q: Does RocksDB allow pinning the memory for a key without copying? With prefix compression, I would think that a copy of the key would be needed one way or another.
Yes that is true. ReadOptions::pin_data
does not pin keys when delta encoding is enabled. So it'd require either turning it off or only using pointers to values.
EDIT: I do see the need for a Go allocation for the pointers, but I do not see the need for a copy of the data into Go memory.
The underlying C++ memory needs to be released. When does that release happen? Because Go is garbage collected we don't have a good signal for when it is safe to do so. Using a finalizer is not a good option as finalizers are not guaranteed to run in any specific time frame. And the running of finalizers is also done by a single goroutine. Having that goroutine make calls into C++ to free the memory seems like we would be venturing into unexplored territory.
Right, that is another unsolved problem. It gets worse because with ReadOptions::pin_data
the pinned keys'/values' lifetime is tied to the iterator's lifetime. So could MVCCScan
create a new iterator each time whose lifetime somehow ends after the protobuf response gets serialized? And if so does it improve performance? Those are questions I'm hoping to try answering.
Right, that is another unsolved problem. It gets worse because with ReadOptions::pin_data the pinned keys'/values' lifetime is tied to the iterator's lifetime. So could MVCCScan create a new iterator each time whose lifetime somehow ends after the protobuf response gets serialized? And if so does it improve performance? Those are questions I'm hoping to try answering.
To some degree, not needing to answer these questions is one of Pebble's strengths. It is interesting that the prior work on moving MVCCScan
to C++ provided such a large improvement that is still visible in the perf numbers. That work took ~2 months (from start to productionized). The initial experimentation took ~1 month. You might expect something similar here.
Got it. How about we put a time limit on how long I investigate whether it's possible, and if so, propose a plan? I don't want to spend three months on it either. But if we're using RocksDB in a suboptimal way and the effects are noticeable, I don't see why not look into improving it.
My proposal would be to see what results a prototype can get by next Tuesday.
My proposal would be to see what results a prototype can get by next Tuesday.
Ack, that seems like a reasonable time bound.
PS The total time on the C++ version of MVCCScan was ~2 months: ~1 month experimenting + ~1 month productionizing.
Rather than trying to remove copies in the existing Cgo interface, we can make the comparison to Pebble more apples-to-apples by adding extra copies to the Pebble scan code. This is much easier to do. I made a quick attempt of that tonight. The results were a less than 3% slowdown, which might have been in the noise or I might have completely mucked something up. This was on my laptop.
That's a good idea. I won't have time to try for a few hours. But iirc the CPU profile showed a large percent of CPU time in allocation, copying, and GC, so I'm surprised that the potential gains by reducing these are limited to 3%.
On Tue, Jul 2, 2019, 6:01 PM Peter Mattis notifications@github.com wrote:
Rather than trying to remove copies in the existing Cgo interface, we can make the comparison to Pebble more apples-to-apples by adding extra copies to the Pebble scan code. This is much easier to do. I made a quick attempt of that tonight. The results were a less than 3% slowdown, which might have been in the noise or I might have completely mucked something up. This was on my laptop.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/38116?email_source=notifications&email_token=ABEPCSXFBNSSZNMQMIMT5O3P5P27BA5CNFSM4HWIGFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZC6OKQ#issuecomment-507897642, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEPCSSV7KBSM5SMNDMNT53P5P27BANCNFSM4HWIGFHQ .
There's a profile above that was taken with batching (https://github.com/cockroachdb/cockroach/issues/38116#issuecomment-500168760). More than 50% of CPU time spent in copying/allocation/GC - too large to ignore, IMO.
Code I'm using: https://github.com/ajkr/pebble/tree/experiment-scripts. Also a couple tiny changes in libroach: (1) set level0_file_num_compaction_trigger=1
, and (2) treat '\0'-suffixed key as a value not a metadata key.
Here is a quick experiment showing the impact of GOGC on benchmark results.
setup: ./bin/roachprod create andrewk-bench --clouds aws -n 1 --aws-machine-type-ssd c5d.4xlarge --local-ssd --local-ssd-no-ext4-barrier
setup: ./pebble ycsb bench --wipe --use-rocksdb --workload insert=100 --values 64 --num-ops 10000000 --concurrency 256 --duration 0s --wait-compactions
bench: GOGC=X /usr/bin/time ./pebble ycsb bench [ --use-rocksdb ] [ --use-batch-scan ] --workload scan=100 --duration 2m --initial-keys 0 --prepopulated-keys 10000000 --keys uniform --concurrency 10 --batch 100
engine | GOGC | ops/sec | peak RSS (MB) |
---|---|---|---|
rocksdb (iter) | 100 | 61542 | 1250 |
rocksdb (iter) | 1000 | 85454 | 2144 |
rocksdb (batch) | 100 | 183294 | 1391 |
rocksdb (batch) | 1000 | 209515 | 2068 |
pebble | 100 | 205100 | 2012 |
To conclude: I think the benchmark results we've seen on RocksDB are affected by the tiny live heap size and default value of GOGC, which together cause unusually heavy GC activity. Perhaps a more apples-to-apples comparison could be setting the GOGC in the Rocks benchmark such that peak RSS matches the Pebble benchmark.
@ajkr This is interesting. Btw, https://github.com/petermattis/pebble/pull/172 should improve the Pebble numbers under concurrency. It has a dramatic affect on the 16 core machine I've been testing on. Not sure of the impact on a 4 core machine which is what I believe you've been using.
More than 50% of CPU time spent in copying/allocation/GC - too large to ignore, IMO.
I haven't seen that in profiles I've looked at. Let me check that again.
Perhaps a more apples-to-apples comparison could be setting the GOGC in the Rocks benchmark such that peak RSS matches the Pebble benchmark.
I'm not sure matching peak RSS achieves an apples-to-apples comparison. Futzing with GOGC can definitely have an impact on performance, but such a massive difference in settings could actually tilt the benchmarks in favor of RocksDB. Let me think about this more, though.
--batch 100
The --batch
flag doesn't affect the number of keys scanned. You need the --scans
flag for that. I think that means you were testing with the default zipf:1-1000
which will mostly be small scans. Was that intentional?
treat '\0'-suffixed key as a value not a metadata key.
Btw, rather than doing this, we should append a value timestamp to the key and the scan at a higher timestamp. Something like:
func (y *ycsb) makeKey(keyNum uint64) []byte {
key := make([]byte, 4, 24+10)
copy(key, "user")
key = strconv.AppendUint(key, y.hashKey(keyNum), 10)
// Use the MVCC encoding for keys. This appends a timestamp with
// walltime=1,logical=0.
key = append(key, '\x00', '\x00', '\x00', '\x00', '\x00',
'\x00', '\x00', '\x00', '\x01', '\x08')
return key
}
This didn't affect perf, but avoids the need to change MVCCScan
.
Attached our two profiles from running ./pebble ycsb /mnt/data1/bench --rocksdb --use-batch-scan --workload scan=100 --duration 30s --initial-keys 0 --prepopulated-keys 10000000 --keys uniform --scans 1
on a 16 core machine. The first (profile.n=1.pdf) was run with --concurrency=1
and the second (profile.n=10.pdf) was run with --concurrency=10
. The time spent in C++ is 87% and 80% respectively. GC time does not appear to be significant. Neither does that takes place in Go. That copying takes place in copyFromSliceVector
:
. . 2480: // Interpret the C pointer as a pointer to a Go array, then slice.
. . 2481: slices := (*[1 << 20]C.DBSlice)(unsafe.Pointer(bufs))[:len:len]
. . 2482: neededBytes := 0
10ms 10ms 2483: for i := range slices {
. . 2484: neededBytes += int(slices[i].len)
. . 2485: }
. 350ms 2486: data := nonZeroingMakeByteSlice(neededBytes)[:0]
10ms 10ms 2487: for i := range slices {
. . 2488: data = append(data, cSliceToUnsafeGoBytes(slices[i])...)
. . 2489: }
. . 2490: return data
. . 2491:}
Note that the profile was gathered over 30s, so these timings are completely inconsequential. Note that this was for a scan size of 1. I'm not terribly surprised by this as the data size being copied is ~100 bytes. This is copyFromSliceVector
with a scan size of 1000:
. . 2480: // Interpret the C pointer as a pointer to a Go array, then slice.
. . 2481: slices := (*[1 << 20]C.DBSlice)(unsafe.Pointer(bufs))[:len:len]
. . 2482: neededBytes := 0
10ms 10ms 2483: for i := range slices {
. . 2484: neededBytes += int(slices[i].len)
. . 2485: }
. 260ms 2486: data := nonZeroingMakeByteSlice(neededBytes)[:0]
10ms 10ms 2487: for i := range slices {
. 720ms 2488: data = append(data, cSliceToUnsafeGoBytes(slices[i])...)
. . 2489: }
. . 2490: return data
. . 2491:}
Now we actually see the copy showing up, but overall this routine is consuming ~1s of CPU out of a 30s run. I'd expect the copying of the key/values on the C++ side to have a similar impact.
Really excited about these experiments you're doing. I'd like to put in a plug for making sure that the scenarios under test include some that are similar to those we get when doing large table scans in CockroachDB.
The maximum number of keys we read at a time is 10,000. If we assume a row size of, say, 100 bytes (16 byte uuid, a couple 8 byte ints, and then a string column or two), we're looking at ~ 1 MB scan all told. Making sure we're testing scans of that magnitude would go a long way toward convincingly creating a comparison between the two storage engines that would hold up to the real world scenarios that the SQL execution engine cares about (at least for workloads that aren't completely pure OLTP).
Attached our two profiles from running ./pebble ycsb /mnt/data1/bench --rocksdb --use-batch-scan --workload scan=100 --duration 30s --initial-keys 0 --prepopulated-keys 10000000 --keys uniform --scans 1 on a 16 core machine. The first (profile.n=1.pdf) was run with --concurrency=1 and the second (profile.n=10.pdf) was run with --concurrency=10. The time spent in C++ is 87% and 80% respectively. GC time does not appear to be significant. Neither does that takes place in Go. That copying takes place in copyFromSliceVector:
Unfortunately I don't have time to collect another profile using perf
right now but will do so later (there is one above but it's a bit stale at this point). But even lacking that, does the fact that tuning GOGC to run less frequently caused a 10+% increase in ops/sec imply that GC is significant?
The --batch flag doesn't affect the number of keys scanned. You need the --scans flag for that. I think that means you were testing with the default zipf:1-1000 which will mostly be small scans. Was that intentional?
Right I fixed that and set --scans 100
on the server but forgot to copy/paste the changed command here.
Unfortunately I don't have time to collect another profile using perf right now but will do so later (there is one above but it's a bit stale at this point). But even lacking that, does the fact that tuning GOGC to run less frequently caused a 10+% increase in ops/sec imply that GC is significant?
Yes, that implies that GC is significant. I've seen a similar level of perf improvement when running CockroachDB with a larger GOGC setting, yet we can't run like that in production. I'm also not seeing the same level of improvement when running on 16 core machines. I might try testing on 4 core machines and see if I can notice this as well. Something I also looked at was running with GODEBUG=gctrace=1
, and the resulting log messages indicated that GC was small single digit percentage of time.
Btw, it is interesting that GOGC is helping the RocksDB numbers. My intuition is that the GC would be penalizing Pebble as it has so much more memory in the Go heap which will make GC more expensive.
My proposal would be to see what results a prototype can get by next Tuesday.
Ack, that seems like a reasonable time bound.
PS The total time on the C++ version of MVCCScan was ~2 months: ~1 month experimenting + ~1 month productionizing.
Time is up. I did not get a working prototype in Cockroach. Will write down what I learned and tried anyways in case we decide to pick this up again.
To serve an MVCCScan
, the following copies/allocations are required (ignoring those beneath block cache inside the storage engine):
block_restart_interval > 1
, the BlockIter
uses IterKey::TrimAppend()
for reading a prefix-encoded key. It copies part of the key and reuses an allocation, enlarging it when needed. block_restart_interval > 1
or !ReadOptions::pin_data
, DBIter
copies the current key into an IterKey
before advancing its internal iterator. Again this reuses a buffer, enlarging it when needed.MVCCScan()
creates a chunkedBuffer
as it scans, copying each key and value into it. chunkedBuffer
has a nice property that it can be expanded without moving existing data.rocksDBIterator::MVCCScan()
copies all the chunkedBuffer
data into a Go-allocated buffer.Marshal()
ing creates a buffer for the full response and copies the above buffer into it (the mechanism differs depending on ScanFormat
, but either way essentially the same amount of data is copied).So it seems like a lot, and the Go allocations were particularly inconvenient due to the unexpected GC assist they incurred. In an attempt to eliminate all the above allocations/copies, I wanted to try passing around pointers to keys/values in block cache all the way down to writev()
, such that key/value data is only read and coalesced into a response in the kernel. That'd require the following changes, only some of which are possible today:
block_restart_interval = 1
. That disables key prefix compression, increasing the on-disk size of sstables and in-memory size of blocks in block cache.ReadOptions::pin_data = true
. That pins all data blocks that are accessed during an iterator's lifetime such that we can hold pointers to the keys/values even as the iterator moves. This would prevent using a pool of iterators otherwise the pinned block count could grow infinitely.The first two changes are easy for prototyping, although their consequences are worrisome. The third, however, I could only find in Cap'nproto in C++. I did use it to write a toy C++ client/server for scanning RocksDB data, and verified it really does pass pointers to keys/values in block cache all the way down to writev()
. That code can be found here: https://gist.github.com/ajkr/a0b7d2a8b7cf1ad24d65338f04731a30
I can think of a few possible next steps if we decide to pick this up again.
MVCCScan
, either using RocksDB or Pebble.The original goal of this task is resolved, so closing.
A few months ago I ran some experiments where the low-level iterator interface that crosses the CGo barrier for every
Next()
performed better than expected compared to the batch scan API (MVCCScan
). I didn't do a good job of recording the methodology or the results, so this has just been stuck in the back of my mind for a while.Recently I had a chance to try again. With a fresh experiment setup and measurements, the low-level iterator interface is again showing higher throughput in most cases, often significantly. In this issue, I will record all the repro steps and investigation.
Experiment branch: https://github.com/ajkr/pebble/tree/experiment-scripts
Build:
In cockroach dir, apply the below patch. This is necessary to treat the zero timestamp keys, which
pebble ycsb
generates, as regular keys rather than intents. I am guessing that this hack doesn't have any outcome-altering side effects, but please let me know if it looks too suspicious. We can always modifypebble ycsb
to generate keys with real timestamps suffixed.make libroach librocksdb
.go build ./cmd/pebble
Experiment:
go run ./cmd/pebble ycsb ./tmp-rocks --wipe --use-rocksdb --workload insert=100 --values 256 --num-ops 10000000 --concurrency 256 --duration 0s --wait-compactions
go run ./cmd/pebble ycsb ./tmp-rocks --use-rocksdb --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
go run ./cmd/pebble ycsb ./tmp-rocks --use-rocksdb --use-batch-scan --workload scan=100 --num-ops 1000000 --duration 0s --scans 100
Another minor patch that gives more stable results is to set
level0_file_num_compaction_trigger=1
in libroach. By default it's two, which causes each trial run to alternate between having 0 L0 sstables and 1 L1 sstable due to the flush that happens during recovery.Results:
There are two scenarios where
MVCCScan
's throughput was higher, and seven scenarios where the low-level iterator's throughput was higher.The above results are one trial only. For a handful of them I tried running a second time to check stability. I always observed very similar results.