Closed nvanbenschoten closed 11 months ago
I bisected the RefreshRange/linear-keys
regression since it has the largest regression (75% for allocs/op). I couldn't automatically bisect over the entire range of commits because we bumped the go version twice in that range, so I focused on go 1.19 which got bumped on September 9. This Pebble bump commit had by far the biggest impact; running benchdiff before and after the commit accounted for 69% of the allocs/op regression and 21% of the alloc/op regression. There was still a 22% of time/op regression that this didn't account for, so I bisected some more between the 23.1 commit and this Pebble bump commit with a threshold of 0.1. This yielded another Pebble bump commit, which accounted for 10% of the time/op regression.
This doesn't paint the full picture, and I certainly haven't tested the other subtests of RefreshRange
but it's a good starting point. Nothing in the Pebble commits looks particularly suspicious to me but there are quite a few commits in each Pebble version bump and I don't have good context here.
After a fair bit of golang version fighting, I bisected TruncateLoop
down to https://github.com/cockroachdb/cockroach/pull/107658. The regression is resulting from the extra field we added to the Header
proto, which is intended.
I tracked down the majority of the regression we're seeing in OptimisticEval
to https://github.com/cockroachdb/cockroach/commit/8e72fe0d4d018819ffaeabd3df201fa19d01d733, which was already fixed in https://github.com/cockroachdb/cockroach/pull/114437. There's still a minor regression ( ~ 1% heap allocs) between [release-23.1, 8e72fe0d4d018819ffaeabd3df201fa19d01d733). I'll run a bisect on that portion to see if something shakes out.
Team, heads up on fix #115147 - the bug could have impacted some microbenchmarks, particularly those in which proposals are sent concurrently.
I've extracted the regression in RefreshRange
into https://github.com/cockroachdb/cockroach/issues/115215 and RaftAdmissionMetaOverhead
into https://github.com/cockroachdb/cockroach/issues/115216.
Comparing 716508be0f7672c50fbec5deeacefdbe44df3ea0 and 6b9599d94f5c738944226f4adc7784415fdb31b8 there's no regression for MVCCGCWithForegroundTraffic
.
``` name old time/op new time/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 12.8µs ± 4% 12.4µs ± 2% -2.85% (p=0.006 n=10+9) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 27.8µs ± 7% 28.2µs ± 6% ~ (p=0.436 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads-24 31.1µs ± 7% 30.8µs ± 5% ~ (p=0.684 n=10+10) name old alloc/op new alloc/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 1.34kB ± 1% 1.33kB ± 0% -0.81% (p=0.004 n=10+9) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 1.78kB ± 1% 1.78kB ± 1% ~ (p=0.493 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads-24 1.52kB ± 2% 1.51kB ± 1% ~ (p=0.367 n=10+9) name old allocs/op new allocs/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 19.6 ± 3% 19.0 ± 0% -3.06% (p=0.006 n=10+9) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 25.0 ± 0% 25.0 ± 0% ~ (all equal) MVCCGCWithForegroundTraffic/gc_with_reads-24 21.0 ± 0% 21.0 ± 0% ~ (all equal) ```
The regression between 23.1(f13c021b)/23.2(6d4a8274):
``` name old time/op new time/op delta MVCCGCWithForegroundTraffic/gc_with_reads-24 29.5µs ± 8% 30.8µs ± 6% ~ (p=0.075 n=10+10) MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 12.0µs ± 1% 12.7µs ± 3% +6.36% (p=0.000 n=7+10) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 26.6µs ± 5% 28.5µs ± 6% +7.39% (p=0.000 n=10+10) name old alloc/op new alloc/op delta MVCCGCWithForegroundTraffic/gc_with_reads-24 1.83kB ±15% 1.49kB ± 2% -18.66% (p=0.000 n=10+8) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 2.10kB ±10% 1.71kB ± 1% -18.42% (p=0.000 n=10+10) MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 1.61kB ± 2% 1.32kB ± 1% -18.11% (p=0.000 n=10+9) name old allocs/op new allocs/op delta MVCCGCWithForegroundTraffic/noop_gc_with_reads-24 22.0 ± 0% 18.0 ± 0% -18.18% (p=0.000 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads-24 23.0 ± 0% 20.0 ± 0% -13.04% (p=0.000 n=10+10) MVCCGCWithForegroundTraffic/gc_with_reads_and_writes-24 26.2 ± 5% 23.0 ± 0% -12.21% (p=0.000 n=10+10) ```
I saw the regression shows up prior to the go bump, I'll start bisecting there.
Bisecting MVCCGCWithForegroundTraffic w/ 0.05 threshold pointed to 199b177e4e835a668ca339e485baf1ba6e7ad69f, which appears unrelated. Bisected between 5f4fe6e2d9c1c40e74ba9f0248063a5f5a3ec4d7 and c0f0059ca529a18d20736d83013cc46f21c1ae67.
``` git bisect log # bad: [c0f0059ca529a18d20736d83013cc46f21c1ae67] Merge #110083 # good: [5f4fe6e2d9c1c40e74ba9f0248063a5f5a3ec4d7] roachtest: set 30m timeout for all disk stall roachtests git bisect start 'c0f0059ca529a18d20736d83013cc46f21c1ae67' '5f4fe6e2d9c1c40e74ba9f0248063a5f5a3ec4d7' # bad: [462fd410f95d0f06129bdda84b344d9b95441cc4] Merge #105015 git bisect bad 462fd410f95d0f06129bdda84b344d9b95441cc4 # bad: [50a0afce0fcd306a22051345e9f3a8c6d93e5928] Merge #100432 #102416 git bisect bad 50a0afce0fcd306a22051345e9f3a8c6d93e5928 # good: [e640de72f80af91beeff2256e8fb25c966c2b845] Merge #99423 #99987 #100565 git bisect good e640de72f80af91beeff2256e8fb25c966c2b845 # bad: [f24b893e5d825bc5fd7b91858426063590e7bdef] Merge #100270 git bisect bad f24b893e5d825bc5fd7b91858426063590e7bdef # bad: [61946783bc69e4fe9dab0f7702e7a0ae550af84d] Merge #100925 #100993 git bisect bad 61946783bc69e4fe9dab0f7702e7a0ae550af84d # good: [7d219406b9c0fc5079403840d55ba40f721512f9] Merge #99958 #100726 #100741 #100778 #100821 git bisect good 7d219406b9c0fc5079403840d55ba40f721512f9 # bad: [570ec001cb66c434a9f382bffca9e4d167dac85b] Merge #100773 #100842 #100898 git bisect bad 570ec001cb66c434a9f382bffca9e4d167dac85b # good: [0f2db80943ae1c8977527aeb89be9e7fc0bd97b2] pgwire: update comments for `limitedCommandResult` git bisect good 0f2db80943ae1c8977527aeb89be9e7fc0bd97b2 # bad: [22ab7edf761e16c09b70caeecb27bd332bfe88ee] Merge #99663 #99947 #100188 #100620 git bisect bad 22ab7edf761e16c09b70caeecb27bd332bfe88ee # bad: [199b177e4e835a668ca339e485baf1ba6e7ad69f] sql: disable multiple active portals for TestDistSQLReceiverDrainsMeta. git bisect bad 199b177e4e835a668ca339e485baf1ba6e7ad69f # good: [dcbcca9468d0e25dd790ea880d6bb3dacb82f36c] sql: store retErr and retPayload for pausable portals git bisect good dcbcca9468d0e25dd790ea880d6bb3dacb82f36c # good: [3d20ce5a9729ac189ea252ab9e6682425b10e65f] sql: correctly set inner plan for leafTxn when resuming flow for portal git bisect good 3d20ce5a9729ac189ea252ab9e6682425b10e65f # good: [789c2cdc27f169ea2d22552dd9906b523b31f126] pgwire: add tests for multiple active portals git bisect good 789c2cdc27f169ea2d22552dd9906b523b31f126 # first bad commit: [199b177e4e835a668ca339e485baf1ba6e7ad69f] sql: disable multiple active portals for TestDistSQLReceiverDrainsMeta. ```
Diffed profile base_5f4fe6e_new_c0f0059_diff_cpu.pb.gz
ReturnOnRangeBoundary
has a ~3% regression between the first commit after the bump to go1.20 and the last commit before the bump to go1.21. I bisected between these two commits two times, once with a threshold of 2% and another time with threshold 1%. The first bisection pointed to e8f27c3506cc93ac6f985dab16add8ebf7d3e5fa and the second bisection pointed to 660e93e111cf26a0bece476207ea2c6c91fbb31e.
These two commits look unrelated. Moreover, comparing them to the preceding commit shows no difference. I'm inclined to conclude that this benchmark slowly degraded over time, as opposed to their being a single offending commit.
Next up, I'll look into the specifics of the benchmark and maybe compare heap profiles between the commits that show a 3% change. @nvanbenschoten, let me know if there are any other threads that we could pull on here.
ReturnOnRangeBoundary has a ~3% regression between the first commit after the bump to go1.20 and the last commit before the bump to go1.21.
The spreadsheet indicates that ReturnOnRangeBoundary
regressed by 28.45% between f13c021b and 6d4a8274. Are we not able to reproduce that? The regression is on the order of 90µs (310µs -> 400µs), which is of a large enough magnitude that it's likely more than just noise.
Are we not able to reproduce that?
My bad, I missed that line in the spreadsheet. I hadn't run the benchmark between [base_sha, go_19_last_sha]
before -- doing so, I see:
benchdiff --new=${go_1_19_last_sha} --old=${base_sha} --post-checkout='bash post-checkout.sh' --run=${test_name} --count=10 ${pkg_name}
....
pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord -
name old time/op new time/op delta
ReturnOnRangeBoundary-24 417µs ± 2% 965µs ±12% +131.36% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
ReturnOnRangeBoundary-24 93.8kB ± 1% 123.7kB ± 5% +31.85% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
ReturnOnRangeBoundary-24 762 ± 0% 1133 ± 5% +48.71% (p=0.000 n=10+10)
Running a bisect now.
Bisected a big chunk to 3502ca2ef2dffd9f19c3bd7d7277324433ac43ad. Running benchdiff between this and the prior commit:
pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord -
name old time/op new time/op delta
ReturnOnRangeBoundary-24 514µs ±13% 767µs ±10% +49.15% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
ReturnOnRangeBoundary-24 96.9kB ± 4% 111.7kB ± 3% +15.26% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
ReturnOnRangeBoundary-24 799 ± 4% 984 ± 7% +23.09% (p=0.000 n=10+10) pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord -
name old time/op new time/op delta
ReturnOnRangeBoundary-24 514µs ±13% 767µs ±10% +49.15% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
ReturnOnRangeBoundary-24 96.9kB ± 4% 111.7kB ± 3% +15.26% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
ReturnOnRangeBoundary-24 799 ± 4% 984 ± 7% +23.09% (p=0.000 n=10+10)
That doesn't fully explain the regression we saw above. I'll run another 2 bisects on each of the halves of [base_sha, 3502ca2ef2dffd9f19c3bd7d7277324433ac43ad]
and [3502ca2ef2dffd9f19c3bd7d7277324433ac43ad, go_19_last_sha]
@kvoli mentioned that he landed https://github.com/cockroachdb/cockroach/pull/113229, which resolved some cpu/memory regressions introduced by https://github.com/cockroachdb/cockroach/commit/3502ca2ef2dffd9f19c3bd7d7277324433ac43ad. That checks out with what I saw while bisecting this as well -- I saw the benchmark improve between [go_1_21_sha, master]
significantly.
@nvanbenschoten is taking over the rest of the investigation (thanks!).
ReturnOnRangeBoundary
was tricky to pin down because its performance regressed and later recovered a few times during the release. I believe I was able to find the regression which never recovered. Interestingly, the cause is the same as the BumpSideTransportClosed
regression. We pulled that out into https://github.com/cockroachdb/cockroach/issues/115058, but we haven't yet bottomed out on why that commit has this effect. I've re-opened that issue to track this regression as well.
We're down to 0 remaining regressions that haven't either been resolved or identified and extracted into separate issues. Nice teamwork!
From the 23.2 microbenchmark results: https://docs.google.com/spreadsheets/d/1l7yX9btMitDyZ1DLJQDaJVNoukpYbk2iu1tgTg1551Q/edit#gid=5
All relevant performance regressions are (at least) reflected in the
sec/op
metric. Some are also reflected in heap allocation metrics, which may explain the slowdown.I've done my best to group related regressions to avoid redundant work.
pkg/kv/kvclient/kvcoord
ReturnOnRangeBoundary
@arulajmani @nvanbenschotenTruncateLoop
(and alsoTruncateLegacy
) @arulajmanipkg/kv/kvserver
BumpSideTransportClosed
@arulajmani (https://github.com/cockroachdb/cockroach/issues/115058)MVCCGCWithForegroundTraffic
@kvoliOptimisticEval
(and alsoOptimisticEvalForLocks
) @arulajmaniReplicaProposal
pkg/kv/kvserver/batcheval
RefreshRange
@miraradevapkg/kv/kvserver/concurrency
LockTable
@arulajmanipkg/kv/kvserver/spanlatch
LatchManagerReadWriteMix
@arulajmanipkg/kv/kvserver/logstore
LogStore_StoreEntries
@kvolipkg/kv/kvserver/raftentry
EntryCacheClearTo
@andrewbaptistpkg/kv/kvserver/raftlog
RaftAdmissionMetaOverhead
@nvanbenschotenpkg/kv/kvserver/tscache
IntervalSklAdd
@nvanbenschotenIntervalSklDecodeValue
@nvanbenschotenTimestampCacheInsertion
@nvanbenschotenpkg/kv/kvserver/concurrency
andpkg/kv/kvserver/spanlatch
These packages are seeing regressions across a collection of btree benchmarks. If this is not noise (it probably is), they likely have the same cause.
BTreeDelete
,BTreeDeleteInsert
,BTreeInsert
, etc. @nvanbenschotenBTreeIterOverlapScan
@nvanbenschotenTips for debugging
Benchdiff can be a useful tool for debugging. Install the tool, then follow these steps to auto-bisect the regression.
Jira issue: CRDB-33448