cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.21k stars 3.82k forks source link

kv: triage microbenchmark regressions for v23.2 #114310

Closed nvanbenschoten closed 11 months ago

nvanbenschoten commented 1 year ago

From the 23.2 microbenchmark results: https://docs.google.com/spreadsheets/d/1l7yX9btMitDyZ1DLJQDaJVNoukpYbk2iu1tgTg1551Q/edit#gid=5

All relevant performance regressions are (at least) reflected in the sec/op metric. Some are also reflected in heap allocation metrics, which may explain the slowdown.

I've done my best to group related regressions to avoid redundant work.

pkg/kv/kvclient/kvcoord

pkg/kv/kvserver

pkg/kv/kvserver/batcheval

pkg/kv/kvserver/concurrency

pkg/kv/kvserver/spanlatch

pkg/kv/kvserver/logstore

pkg/kv/kvserver/raftentry

pkg/kv/kvserver/raftlog

pkg/kv/kvserver/tscache

pkg/kv/kvserver/concurrency and pkg/kv/kvserver/spanlatch

These packages are seeing regressions across a collection of btree benchmarks. If this is not noise (it probably is), they likely have the same cause.


Tips for debugging

Benchdiff can be a useful tool for debugging. Install the tool, then follow these steps to auto-bisect the regression.

# Setup steps:
export base_sha=f13c021b
export master_sha=6d4a8274
echo 'git clean -e benchdiff -e post-checkout.sh -e .bazelrc.user -e _bazel -e bin/.dev-status -e bin/dev-versions -dxff' >> post-checkout.sh
echo './dev generate' >> post-checkout.sh
echo './dev build short' >> post-checkout.sh

# Per-benchmark configuration.
export pkg_name='./pkg/kv/kvserver/spanlatch'
export test_name='BTreeDeleteInsert/count=16'
export threshold=0.2

# Single comparison between master and release-23.1:
benchdiff --new=${master_sha} --old=${base_sha} --post-checkout='bash post-checkout.sh' --run=${test_name} --count=10 ${pkg_name}
# OR auto-bisect regression:
rm -rf ./benchdiff/*
git bisect reset
git bisect start ${master_sha} ${base_sha}
git bisect run benchdiff --old=${base_sha} --post-checkout='bash post-checkout.sh' --run=${test_name} --count=10 --threshold=${threshold} ${pkg_name}

Jira issue: CRDB-33448

miraradeva commented 1 year ago

I bisected the RefreshRange/linear-keys regression since it has the largest regression (75% for allocs/op). I couldn't automatically bisect over the entire range of commits because we bumped the go version twice in that range, so I focused on go 1.19 which got bumped on September 9. This Pebble bump commit had by far the biggest impact; running benchdiff before and after the commit accounted for 69% of the allocs/op regression and 21% of the alloc/op regression. There was still a 22% of time/op regression that this didn't account for, so I bisected some more between the 23.1 commit and this Pebble bump commit with a threshold of 0.1. This yielded another Pebble bump commit, which accounted for 10% of the time/op regression.

This doesn't paint the full picture, and I certainly haven't tested the other subtests of RefreshRange but it's a good starting point. Nothing in the Pebble commits looks particularly suspicious to me but there are quite a few commits in each Pebble version bump and I don't have good context here.

arulajmani commented 1 year ago

After a fair bit of golang version fighting, I bisected TruncateLoop down to https://github.com/cockroachdb/cockroach/pull/107658. The regression is resulting from the extra field we added to the Header proto, which is intended.

arulajmani commented 1 year ago

I tracked down the majority of the regression we're seeing in OptimisticEval to https://github.com/cockroachdb/cockroach/commit/8e72fe0d4d018819ffaeabd3df201fa19d01d733, which was already fixed in https://github.com/cockroachdb/cockroach/pull/114437. There's still a minor regression ( ~ 1% heap allocs) between [release-23.1, 8e72fe0d4d018819ffaeabd3df201fa19d01d733). I'll run a bisect on that portion to see if something shakes out.

pav-kv commented 1 year ago

Team, heads up on fix #115147 - the bug could have impacted some microbenchmarks, particularly those in which proposals are sent concurrently.

nvanbenschoten commented 1 year ago

I've extracted the regression in RefreshRange into https://github.com/cockroachdb/cockroach/issues/115215 and RaftAdmissionMetaOverhead into https://github.com/cockroachdb/cockroach/issues/115216.

kvoli commented 1 year ago

Comparing 716508be0f7672c50fbec5deeacefdbe44df3ea0 and 6b9599d94f5c738944226f4adc7784415fdb31b8 there's no regression for MVCCGCWithForegroundTraffic.

Details

``` name old time/op new time/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 12.8µs ± 4% 12.4µs ± 2% -2.85% (p=0.006 n=10+9) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 27.8µs ± 7% 28.2µs ± 6% ~ (p=0.436 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads-24 31.1µs ± 7% 30.8µs ± 5% ~ (p=0.684 n=10+10) name old alloc/op new alloc/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 1.34kB ± 1% 1.33kB ± 0% -0.81% (p=0.004 n=10+9) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 1.78kB ± 1% 1.78kB ± 1% ~ (p=0.493 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads-24 1.52kB ± 2% 1.51kB ± 1% ~ (p=0.367 n=10+9) name old allocs/op new allocs/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 19.6 ± 3% 19.0 ± 0% -3.06% (p=0.006 n=10+9) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 25.0 ± 0% 25.0 ± 0% ~ (all equal) MVCCGCWithForegroundTraffic/gc_with_reads-24 21.0 ± 0% 21.0 ± 0% ~ (all equal) ```

The regression between 23.1(f13c021b)/23.2(6d4a8274):

Details

``` name old time/op new time/op delta MVCCGCWithForegroundTraffic/gc_with_reads-24 29.5µs ± 8% 30.8µs ± 6% ~ (p=0.075 n=10+10) MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 12.0µs ± 1% 12.7µs ± 3% +6.36% (p=0.000 n=7+10) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 26.6µs ± 5% 28.5µs ± 6% +7.39% (p=0.000 n=10+10) name old alloc/op new alloc/op delta MVCCGCWithForegroundTraffic/gc_with_reads-24 1.83kB ±15% 1.49kB ± 2% -18.66% (p=0.000 n=10+8) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 2.10kB ±10% 1.71kB ± 1% -18.42% (p=0.000 n=10+10) MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 1.61kB ± 2% 1.32kB ± 1% -18.11% (p=0.000 n=10+9) name old allocs/op new allocs/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 22.0 ± 0% 18.0 ± 0% -18.18% (p=0.000 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads-24 23.0 ± 0% 20.0 ± 0% -13.04% (p=0.000 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 26.2 ± 5% 23.0 ± 0% -12.21% (p=0.000 n=10+10) ```

I saw the regression shows up prior to the go bump, I'll start bisecting there.

kvoli commented 1 year ago

Bisecting MVCCGCWithForegroundTraffic w/ 0.05 threshold pointed to 199b177e4e835a668ca339e485baf1ba6e7ad69f, which appears unrelated. Bisected between 5f4fe6e2d9c1c40e74ba9f0248063a5f5a3ec4d7 and c0f0059ca529a18d20736d83013cc46f21c1ae67.

Details

``` git bisect log # bad: [c0f0059ca529a18d20736d83013cc46f21c1ae67] Merge #110083 # good: [5f4fe6e2d9c1c40e74ba9f0248063a5f5a3ec4d7] roachtest: set 30m timeout for all disk stall roachtests git bisect start 'c0f0059ca529a18d20736d83013cc46f21c1ae67' '5f4fe6e2d9c1c40e74ba9f0248063a5f5a3ec4d7' # bad: [462fd410f95d0f06129bdda84b344d9b95441cc4] Merge #105015 git bisect bad 462fd410f95d0f06129bdda84b344d9b95441cc4 # bad: [50a0afce0fcd306a22051345e9f3a8c6d93e5928] Merge #100432 #102416 git bisect bad 50a0afce0fcd306a22051345e9f3a8c6d93e5928 # good: [e640de72f80af91beeff2256e8fb25c966c2b845] Merge #99423 #99987 #100565 git bisect good e640de72f80af91beeff2256e8fb25c966c2b845 # bad: [f24b893e5d825bc5fd7b91858426063590e7bdef] Merge #100270 git bisect bad f24b893e5d825bc5fd7b91858426063590e7bdef # bad: [61946783bc69e4fe9dab0f7702e7a0ae550af84d] Merge #100925 #100993 git bisect bad 61946783bc69e4fe9dab0f7702e7a0ae550af84d # good: [7d219406b9c0fc5079403840d55ba40f721512f9] Merge #99958 #100726 #100741 #100778 #100821 git bisect good 7d219406b9c0fc5079403840d55ba40f721512f9 # bad: [570ec001cb66c434a9f382bffca9e4d167dac85b] Merge #100773 #100842 #100898 git bisect bad 570ec001cb66c434a9f382bffca9e4d167dac85b # good: [0f2db80943ae1c8977527aeb89be9e7fc0bd97b2] pgwire: update comments for `limitedCommandResult` git bisect good 0f2db80943ae1c8977527aeb89be9e7fc0bd97b2 # bad: [22ab7edf761e16c09b70caeecb27bd332bfe88ee] Merge #99663 #99947 #100188 #100620 git bisect bad 22ab7edf761e16c09b70caeecb27bd332bfe88ee # bad: [199b177e4e835a668ca339e485baf1ba6e7ad69f] sql: disable multiple active portals for TestDistSQLReceiverDrainsMeta. git bisect bad 199b177e4e835a668ca339e485baf1ba6e7ad69f # good: [dcbcca9468d0e25dd790ea880d6bb3dacb82f36c] sql: store retErr and retPayload for pausable portals git bisect good dcbcca9468d0e25dd790ea880d6bb3dacb82f36c # good: [3d20ce5a9729ac189ea252ab9e6682425b10e65f] sql: correctly set inner plan for leafTxn when resuming flow for portal git bisect good 3d20ce5a9729ac189ea252ab9e6682425b10e65f # good: [789c2cdc27f169ea2d22552dd9906b523b31f126] pgwire: add tests for multiple active portals git bisect good 789c2cdc27f169ea2d22552dd9906b523b31f126 # first bad commit: [199b177e4e835a668ca339e485baf1ba6e7ad69f] sql: disable multiple active portals for TestDistSQLReceiverDrainsMeta. ```

Diffed profile base_5f4fe6e_new_c0f0059_diff_cpu.pb.gz

arulajmani commented 1 year ago

ReturnOnRangeBoundary has a ~3% regression between the first commit after the bump to go1.20 and the last commit before the bump to go1.21. I bisected between these two commits two times, once with a threshold of 2% and another time with threshold 1%. The first bisection pointed to e8f27c3506cc93ac6f985dab16add8ebf7d3e5fa and the second bisection pointed to 660e93e111cf26a0bece476207ea2c6c91fbb31e.

These two commits look unrelated. Moreover, comparing them to the preceding commit shows no difference. I'm inclined to conclude that this benchmark slowly degraded over time, as opposed to their being a single offending commit.

Next up, I'll look into the specifics of the benchmark and maybe compare heap profiles between the commits that show a 3% change. @nvanbenschoten, let me know if there are any other threads that we could pull on here.

nvanbenschoten commented 11 months ago

ReturnOnRangeBoundary has a ~3% regression between the first commit after the bump to go1.20 and the last commit before the bump to go1.21.

The spreadsheet indicates that ReturnOnRangeBoundary regressed by 28.45% between f13c021b and 6d4a8274. Are we not able to reproduce that? The regression is on the order of 90µs (310µs -> 400µs), which is of a large enough magnitude that it's likely more than just noise.

arulajmani commented 11 months ago

Are we not able to reproduce that?

My bad, I missed that line in the spreadsheet. I hadn't run the benchmark between [base_sha, go_19_last_sha] before -- doing so, I see:

benchdiff --new=${go_1_19_last_sha} --old=${base_sha} --post-checkout='bash post-checkout.sh' --run=${test_name} --count=10 ${pkg_name}
....
  pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord -

name                      old time/op    new time/op    delta
ReturnOnRangeBoundary-24     417µs ± 2%     965µs ±12%  +131.36%  (p=0.000 n=10+10)

name                      old alloc/op   new alloc/op   delta
ReturnOnRangeBoundary-24    93.8kB ± 1%   123.7kB ± 5%   +31.85%  (p=0.000 n=10+10)

name                      old allocs/op  new allocs/op  delta
ReturnOnRangeBoundary-24       762 ± 0%      1133 ± 5%   +48.71%  (p=0.000 n=10+10)

Running a bisect now.

arulajmani commented 11 months ago

Bisected a big chunk to 3502ca2ef2dffd9f19c3bd7d7277324433ac43ad. Running benchdiff between this and the prior commit:

  pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord -

name                      old time/op    new time/op    delta
ReturnOnRangeBoundary-24     514µs ±13%     767µs ±10%  +49.15%  (p=0.000 n=10+10)

name                      old alloc/op   new alloc/op   delta
ReturnOnRangeBoundary-24    96.9kB ± 4%   111.7kB ± 3%  +15.26%  (p=0.000 n=10+10)

name                      old allocs/op  new allocs/op  delta
ReturnOnRangeBoundary-24       799 ± 4%       984 ± 7%  +23.09%  (p=0.000 n=10+10)  pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord -

name                      old time/op    new time/op    delta
ReturnOnRangeBoundary-24     514µs ±13%     767µs ±10%  +49.15%  (p=0.000 n=10+10)

name                      old alloc/op   new alloc/op   delta
ReturnOnRangeBoundary-24    96.9kB ± 4%   111.7kB ± 3%  +15.26%  (p=0.000 n=10+10)

name                      old allocs/op  new allocs/op  delta
ReturnOnRangeBoundary-24       799 ± 4%       984 ± 7%  +23.09%  (p=0.000 n=10+10)

That doesn't fully explain the regression we saw above. I'll run another 2 bisects on each of the halves of [base_sha, 3502ca2ef2dffd9f19c3bd7d7277324433ac43ad] and [3502ca2ef2dffd9f19c3bd7d7277324433ac43ad, go_19_last_sha]

arulajmani commented 11 months ago

@kvoli mentioned that he landed https://github.com/cockroachdb/cockroach/pull/113229, which resolved some cpu/memory regressions introduced by https://github.com/cockroachdb/cockroach/commit/3502ca2ef2dffd9f19c3bd7d7277324433ac43ad. That checks out with what I saw while bisecting this as well -- I saw the benchmark improve between [go_1_21_sha, master] significantly.

@nvanbenschoten is taking over the rest of the investigation (thanks!).

nvanbenschoten commented 11 months ago

ReturnOnRangeBoundary was tricky to pin down because its performance regressed and later recovered a few times during the release. I believe I was able to find the regression which never recovered. Interestingly, the cause is the same as the BumpSideTransportClosed regression. We pulled that out into https://github.com/cockroachdb/cockroach/issues/115058, but we haven't yet bottomed out on why that commit has this effect. I've re-opened that issue to track this regression as well.

We're down to 0 remaining regressions that haven't either been resolved or identified and extracted into separate issues. Nice teamwork!