cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

kv95/enc=false/nodes=3/cpu=96 regression on July 27 #70357

Closed stevendanna closed 1 year ago

stevendanna commented 3 years ago

Describe the problem

The performance of kv95/enc=false/nodes=3/cpu=96 regressed around 27 July from ~120k request to ~110k requests.

https://roachperf.crdb.dev/?filter=cpu%3D96&view=kv95%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=gce

While this aligns with the introduction of the roachprod environment variable bug, it did not recover after that bug was fixed.

Testing with a binary built from 9baaa282b3 seems to reveal that we can see a similar performance drop just by varying the workload, roachtest, and roachprod binaries.

For example, using all binaries from 9baaa282b3

results.9baaa282b3-old-workload/kv95/enc=false/nodes=3/cpu=96/run_1/run_124133.778339000_n4_workload_run_kv.log:  600.0s        0       70423673       117372.6      1.6      1.1      6.3     12.6    104.9  
results.9baaa282b3-old-workload/kv95/enc=false/nodes=3/cpu=96/run_2/run_124154.426706000_n4_workload_run_kv.log:  600.0s        0       70501559       117502.4      1.6      1.0      6.3     12.6    109.1  
results.9baaa282b3-old-workload/kv95/enc=false/nodes=3/cpu=96/run_3/run_124145.024887000_n4_workload_run_kv.log:  600.0s        0       70195156       116991.7      1.6      1.1      6.6     12.6    838.9 

But using cockroach from 9baaa282b3 but workload, roachtest, and roachprod from master:

artifacts/kv95/enc=false/nodes=3/cpu=96/run_1/run_140156.218649000_n4_workload_run_kv.log:  600.0s        0       67906942       113177.9      1.7      1.1      6.3     12.6     96.5  
artifacts/kv95/enc=false/nodes=3/cpu=96/run_2/run_140157.359383000_n4_workload_run_kv.log:  600.0s        0       67302108       112169.7      1.7      1.1      6.3     12.1    419.4  
artifacts/kv95/enc=false/nodes=3/cpu=96/run_3/run_140204.484130000_n4_workload_run_kv.log:  600.0s        0       67124731       111874.2      1.7      1.1      6.6     12.6    109.1 

Jira issue: CRDB-10044

stevendanna commented 3 years ago

Bisecting just the workload binary suggest that

https://github.com/cockroachdb/cockroach/pull/68608

accounts for about 4~6k of the drop. This merged the day after the roachprod fix.

ajwerner commented 3 years ago

I wonder if this is going to have something to do with statement caching and increased mutex contention. Just throwing things out there.

jordanlewis commented 3 years ago

cc @rafiss

rafiss commented 3 years ago

https://github.com/cockroachdb/cockroach/pull/68608 contains a change that makes all statements get prepared and cached no matter what.

This "prepare and cache all statements" behavior was disabled in https://github.com/cockroachdb/cockroach/pull/69313/commits/96c260f87546372603634c904e349646d5d56738 on August 24.

Also, there was a bug in the prepare logic that was fixed in https://github.com/cockroachdb/cockroach/pull/69691/commits/60dd572e552137607a06465b5dd885e5184f1943 on September 1.

I'm not yet speculating about what the root cause is, just recounting what things have been changing in workload and when. So we didn't see any additional improvement from 96c260f87546372603634c904e349646d5d56738 or 60dd572e552137607a06465b5dd885e5184f1943 ?

stevendanna commented 3 years ago

I'm not yet speculating about what the root cause is, just recounting what things have been changing in workload and when. So we didn't see any additional improvement from 96c260f or 60dd572 ?

I don't think so, but the data is pretty noisy compared to some of the other kv tests.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

rafiss commented 1 year ago

Closing since this is stale and the throughput of this test has now recovered to 145k req/s