roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor]

kvoli commented 1 year ago

kv0/nodes=3/cpu=96 had a serious (66%) regression in terms of ops/s since the Apr 30th run. The regression occurs on both GCE and AWS.

Jira issue: CRDB-27754

blathers-crl[bot] commented 1 year ago

Hi @kvoli, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

smg260 commented 1 year ago

Bisection results, PR https://github.com/cockroachdb/cockroach/pull/101486 (cc: @rafiss )

2023-05-09T16:46:45Z   Bisecting regression in [kv0/enc=false/nodes=3/cpu=96] using commit range [4619767bb0b (known good),7d02e061759 (known bad)]
2023-05-09T16:46:45Z   Thresholds [good >= 43000, bad <= 27000]
2023-05-09T17:04:12Z   [1ee30a1fe7e] median ops/s: [26590]. Auto marked as bad.
2023-05-09T17:33:17Z   [2f03212bd1f] median ops/s: [24438]. Auto marked as bad.
2023-05-09T18:02:09Z   [608f03a999a] median ops/s: [23428]. Auto marked as bad.
2023-05-09T18:32:17Z   [da48b19bcc2] median ops/s: [25457]. Auto marked as bad.
2023-05-09T18:32:17Z   Bisection complete. Suspect commit:
commit da48b19bcc2f962dda531d075502682b64326bf4
Author: Rafi Shamim <rafi@cockroachlabs.com>
Date:   Thu Apr 13 15:12:45 2023 -0400

    sql: use global defaults for internal session data rather than zero-values

    Release note: None

rafiss commented 1 year ago

Thanks for tracking that down. I was wondering if my PR would turn anything up. If you didn't get a chance to look at it, the summary is that we previously were using zero-valued defaults for all session variables used in internal operations (jobs, leases, delegates queries for some builtin functions, etc). After https://github.com/cockroachdb/cockroach/pull/101486, the default values for session variables will be used instead.

One notable change is that previously, internal operations would use distsql=off but now they will use distsql=auto. Another one is distsql_workmem, which previously was unbounded, but now uses the default limit of 64 MiB.

To track this down further, I'd be curious to see if there are any specific queries that are slower. Then we can look at their traces.

smg260 commented 1 year ago

@rafiss Thanks for the change overview. I might be of limited help here, but can you link me any documentation on obtaining query stats for this particular workload during a roachtest run?

irfansharif commented 1 year ago

This regression also appears for kv0/enc=false/nodes=3/cpu=32: https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D32&tab=aws.

irfansharif commented 1 year ago

kv0-traces.zip

stmt-bundle-870525437835771905/ - slow trace stmt-bundle-870526672146366465/ - fast trace

irfansharif commented 1 year ago

Partway through a full kv0 run, there's a step change in latency distributions. It becomes bi-modal and then throughput suffers:

irfansharif commented 1 year ago

Slow executions have an additional RTT in there:

     1.021ms      0.066ms                            event:kv/kvclient/kvcoord/range_iter.go:225 [n1,client=10.142.0.47:44224,user=root,txn=44ab155c] key: /Table/106/1/-5473934171196112927/0, desc: r1012:/Table/106/1/-54{82423937990601008-63995622232649408} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=14, sticky=9223372036.854775807,2147483647]
     1.047ms      0.026ms                            event:kv/kvclient/kvcoord/dist_sender.go:2136 [n1,client=10.142.0.47:44224,user=root,txn=44ab155c] r1012: sending batch 1 Put, 1 EndTxn to (n2,s2):2
     1.059ms      0.012ms                            event:rpc/nodedialer/nodedialer.go:166 [n1,client=10.142.0.47:44224,user=root,txn=44ab155c] sending request to 10.142.0.45:26257
     1.064ms      0.005ms                                === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:1 client:10.142.0.47:44224 user:root txn:44ab155c span.kind:client
    46.497ms     45.432ms                                === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:2 span.kind:server request:Put [/Table/106/1/-5473934171196112927/0,/Min), EndTxn(parallel commit) [/Table/106/1/-5473934171196112927/0], [txn: 44ab155c], [can-forward-ts]
    46.497ms      0.000ms                                [local proposal: {count: 1, duration 8ms, unfinished}]
    47.778ms      1.282ms                                event:server/node.go:1167 [n2] node received request: 1 Put, 1 EndTxn
    52.347ms      4.569ms                                event:kv/kvserver/store_send.go:154 [n2,s2] executing Put [/Table/106/1/-5473934171196112927/0,/Min), EndTxn(parallel commit) [/Table/106/1/-5473934171196112927/0], [txn: 44ab155c], [can-forward-ts]
    52.378ms      0.031ms                                event:kv/kvserver/replica_send.go:185 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] read-write path
    52.405ms      0.027ms                                event:kv/kvserver/concurrency/concurrency_manager.go:195 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] sequencing request
    52.549ms      0.144ms                                event:kv/kvserver/concurrency/concurrency_manager.go:276 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] acquiring latches
    53.084ms      0.535ms                                event:kv/kvserver/concurrency/concurrency_manager.go:320 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] scanning lock table for conflicting locks
    53.511ms      0.428ms                                event:kv/kvserver/replica_write.go:164 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] applied timestamp cache
    54.662ms      1.150ms                                event:kv/kvserver/replica_write.go:403 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] executing read-write batch
    55.204ms      0.542ms                                event:kv/kvserver/replica_write.go:493 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] attempting 1PC execution
    57.949ms      2.746ms                                event:kv/kvserver/replica_evaluate.go:496 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] evaluated Put command header:<kvnemesis_seq:<> key:"\362\211\200\264\010\255\337\231\373\277\341\210" sequence:1 > value:<raw_bytes:"\\e\317s\n&\001\230" timestamp:<> > , txn=<nil> : resp=header:<> , err=<nil>
    59.905ms      1.955ms                                event:kv/kvserver/replica_proposal.go:793 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] need consensus on write batch with op count=1
    61.162ms      1.257ms                                event:kv/kvserver/replica_raft.go:133 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] evaluated request
    62.214ms      1.052ms                                event:kv/kvserver/replica_raft.go:177 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] proposing command to write 1 new keys, 1 new values, 0 new intents, write batch size=47 bytes
    63.098ms      0.884ms                                event:kv/kvserver/replica_raft.go:285 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] acquiring proposal quota (182 bytes)
    63.887ms      0.789ms                                event:kv/kvserver/replica_raft.go:570 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] submitting proposal to proposal buffer
    64.070ms      0.182ms                                event:kv/kvserver/replica_proposal_buf.go:561 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…}] flushing proposal to Raft
   128.259ms     64.189ms                                    === operation:local proposal _verbose:1 node:2 store:2 range:1012/2:/Table/106/1/-54{8242…-6399…} raft:
   128.323ms      0.065ms                                    event:kv/kvserver/replica_application_cmd.go:145 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…},raft] ack-ing replication success to the client; application will continue async w.r.t. the client
   129.029ms      0.706ms                                    event:kv/kvserver/app_batch.go:121 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…},raft] applying command
   133.523ms      4.494ms                                    event:kv/kvserver/replica_application_state_machine.go:176 [n2,s2,r1012/2:/Table/106/1/-54{8242…-6399…},raft] LocalResult (reply: (err: <nil>), *kvpb.PutResponse, *kvpb.EndTxnResponse, #encountered intents: 0, #acquired locks: 0, #resolved locks: 1#updated txns: 1 #end txns: 0, GossipFirstRange:false MaybeGossipSystemConfig:false MaybeGossipSystemConfigIfHaveFailure:false MaybeAddToSplitQueue:false MaybeGossipNodeLiveness:<nil>

Fast ones don't:

     3.447ms      0.012ms                            === operation:dist sender send _verbose:1 node:1 client:10.142.0.47:45386 user:root txn:e46bb31a
     3.447ms      0.000ms                            [/cockroach.roachpb.Internal/Batch: {count: 1, duration 7ms}]
     3.447ms      0.000ms                            [local proposal: {count: 1, duration 191µs, unfinished}]
     3.511ms      0.063ms                            event:kv/kvclient/kvcoord/range_iter.go:188 [n1,client=10.142.0.47:45386,user=root,txn=e46bb31a] querying next range at /Table/106/1/-5349021339240408575/0
     3.648ms      0.137ms                            event:kv/kvclient/kvcoord/range_iter.go:225 [n1,client=10.142.0.47:45386,user=root,txn=e46bb31a] key: /Table/106/1/-5349021339240408575/0, desc: r948:/Table/106/1/-53{53425727684939808-34997411926988208} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=14, sticky=9223372036.854775807,2147483647]
     3.711ms      0.063ms                            event:kv/kvclient/kvcoord/dist_sender.go:2136 [n1,client=10.142.0.47:45386,user=root,txn=e46bb31a] r948: sending batch 1 Put, 1 EndTxn to (n1,s1):1
     3.733ms      0.021ms                            event:rpc/nodedialer/nodedialer.go:157 [n1,client=10.142.0.47:45386,user=root,txn=e46bb31a] sending request to local client
     3.740ms      0.007ms                                === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:1 span.kind:server request:Put [/Table/106/1/-5349021339240408575/0,/Min), EndTxn(parallel commit) [/Table/106/1/-5349021339240408575/0], [txn: e46bb31a], [can-forward-ts]
     3.740ms      0.000ms                                [local proposal: {count: 1, duration 191µs, unfinished}]
     3.762ms      0.022ms                                event:server/node.go:1167 [n1] node received request: 1 Put, 1 EndTxn
     5.205ms      1.442ms                                event:kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller.go:157 [n1] admitted request (pri=normal-pri stream=t1/s1 tokens=+16 MiB wait-duration=1.283583ms mode=apply_to_all)
     5.295ms      0.091ms                                event:kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller.go:157 [n1] admitted request (pri=normal-pri stream=t1/s2 tokens=+16 MiB wait-duration=4.896µs mode=apply_to_all)
     6.148ms      0.853ms                                event:kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller.go:157 [n1] admitted request (pri=normal-pri stream=t1/s3 tokens=+16 MiB wait-duration=584.327µs mode=apply_to_all)
     6.293ms      0.145ms                                event:kv/kvserver/store_send.go:154 [n1,s1] executing Put [/Table/106/1/-5349021339240408575/0,/Min), EndTxn(parallel commit) [/Table/106/1/-5349021339240408575/0], [txn: e46bb31a], [can-forward-ts]
     6.321ms      0.028ms                                event:kv/kvserver/replica_send.go:185 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] read-write path
     6.356ms      0.035ms                                event:kv/kvserver/concurrency/concurrency_manager.go:195 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] sequencing request
     6.370ms      0.014ms                                event:kv/kvserver/concurrency/concurrency_manager.go:276 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] acquiring latches
     6.399ms      0.029ms                                event:kv/kvserver/concurrency/concurrency_manager.go:320 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] scanning lock table for conflicting locks
     6.429ms      0.030ms                                event:kv/kvserver/replica_write.go:164 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] applied timestamp cache
     6.450ms      0.021ms                                event:kv/kvserver/replica_write.go:403 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] executing read-write batch
     6.484ms      0.033ms                                event:kv/kvserver/replica_write.go:493 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] attempting 1PC execution
     6.781ms      0.298ms                                event:kv/kvserver/replica_evaluate.go:496 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] evaluated Put command header:<kvnemesis_seq:<> key:"\362\211\200\265\304un\252\227Z\001\210" sequence:1 > value:<raw_bytes:"\tj/\212\n&\001u" timestamp:<> > , txn=<nil> : resp=header:<> , err=<nil>
     6.821ms      0.039ms                                event:kv/kvserver/replica_proposal.go:793 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] need consensus on write batch with op count=1
     6.885ms      0.065ms                                event:kv/kvserver/replica_raft.go:133 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] evaluated request
     6.912ms      0.026ms                                event:kv/kvserver/replica_raft.go:177 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] proposing command to write 1 new keys, 1 new values, 0 new intents, write batch size=47 bytes
     6.930ms      0.019ms                                event:kv/kvserver/replica_raft.go:285 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] acquiring proposal quota (183 bytes)
     6.962ms      0.032ms                                event:kv/kvserver/replica_raft.go:570 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] submitting proposal to proposal buffer
     7.001ms      0.039ms                                event:kv/kvserver/replica_proposal_buf.go:561 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…}] flushing proposal to Raft
    10.798ms      3.285ms                                    === operation:local proposal _unfinished:1 _verbose:1 node:1 store:1 range:948/1:/Table/106/1/-53{5342…-3499…} raft:
    10.847ms      0.049ms                                    event:kv/kvserver/replica_application_cmd.go:145 [n1,s1,r948/1:/Table/106/1/-53{5342…-3499…},raft] ack-ing replication success to the client; application will continue async w.r.t. the client
    11.299ms      8.280ms            event:component:<flow_id:<555bddd9-1132-43c3-a05b-c20341437e66> type:PROCESSOR id:1 sql_instance_id:1 > net_rx:<latency:<> wait_time:<> deserialization_time:<> tuples_received:<> bytes_received:<> messages_received:<> > net_tx:<tuples_sent:<> bytes_sent:<> messages_sent:<> > kv:<bytes_read:<> tuples_read:<> batch_requests_issued:<> kv_time:<> contention_time:<> num_interface_steps:<> num_internal_steps:<> num_interface_seeks:<> num_internal_seeks:<> block_bytes:<> block_bytes_in_cache:<> key_bytes:<> value_bytes:<> point_count:<> points_covered_by_range_tombstones:<> range_key_count:<> range_key_contained_points:<> range_key_skipped_points:<> kv_cpu_time:<> num_gets:<> num_scans:<> num_reverse_scans:<> > exec:<exec_time:<value_plus_one:<nanos:8040854 > > max_allocated_mem:<> max_allocated_disk:<> consumed_ru:<> cpu_time:<value_plus_one:<nanos:1997745 > > > output:<num_batches:<value_plus_one:2 > num_tuples:<value_plus_one:2 > > flow_stats:<max_mem_usage:<> max_disk_usage:<> consumed_ru:<> >

irfansharif commented 1 year ago

Another one is distsql_workmem, which previously was unbounded, but now uses the default limit of 64 MiB.

It wasn't previously unbounded, I don't think:

https://github.com/cockroachdb/cockroach/blob/84ed8acb1eeb745828368da782f2900b43bd01ad/pkg/sql/execinfra/server_config.go#L340-L343

irfansharif commented 1 year ago

Here's another slow trace that doesn't straddle multiple nodes. But it's still the higher KV exec latencies (around proposal replication/acknowledgement) that's affecting top-line throughput.

Here's a fast trace that straddles two nodes (captured in the first 5 minutes of kv0/nodes=3/cpu=96 when throughput is still high + latencies low) that also has low replication latency. Why the step change few minutes in?

irfansharif commented 1 year ago

I also kicked off two manual runs using 1ee30a1fe7e (which includes the merge commit from https://github.com/cockroachdb/cockroach/pull/101486) and 2c9bdbbad1b (the merge commit from before). The bisection is indeed correct. Though I don't understand why it's caused the regression we're seeing.

irfansharif commented 1 year ago

Two runs of kv0/nodes=3/cpu=96 from a recent master and then a master with all commits from https://github.com/cockroachdb/cockroach/pull/101486 reverted. Just to sanity check that it's that PR that's done something somehow.

irfansharif commented 1 year ago

Of the commits in that PR, it's https://github.com/cockroachdb/cockroach/commit/da48b19bcc2f962dda531d075502682b64326bf4 in particular where the regression appears. Confirmed this by isolating the SHA (reverted all else from master and re-running kv0 experiments while still observing degradation).

irfansharif commented 1 year ago

I looked through debug zips, looking for clues in logs around when one of these step changes happened. I saw things like:

I230604 16:01:34.939333 5256528 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 424 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:02:34.945126 5721242 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 425 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:03:34.952456 6151338 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 426 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:04:34.960182 6524605 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 427 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:05:27.405226 96 util/admission/work_queue.go:512 ⋮ [T1,n3] 428 ‹sql-kv-response›: FIFO threshold for tenant 1 is -128
I230604 16:05:27.505068 94 util/admission/work_queue.go:512 ⋮ [T1,n3] 429 ‹kv›: FIFO threshold for tenant 1 is -128
I230604 16:05:27.605032 395 util/admission/work_queue.go:512 ⋮ [T1,n3] 430 ‹kv›: FIFO threshold for tenant 1 is -128
I230604 16:05:35.038644 6855831 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 431 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:06:35.044587 7161519 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 432 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:07:35.051411 7464761 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 433 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:08:35.058832 7760751 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 434 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:08:35.067411 7910897 jobs/registry.go:617 ⋮ [T1,n3,intExec=‹create-stats›] 435 active version is 23.1-6

The 106 table ID corresponds to kv.kv that this workload is upserting into. I next tried disabling the auto stats collection mechanism:

SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false;

It's made all the difference. Here are two runs of kv0/nodes=3/cpu=96 with and without automatics SQL stats collection.

I did the same at https://github.com/cockroachdb/cockroach/commit/1ee30a1fe7e56814a37c22457078c3a6bd0ede27 but running kv0/enc=false/nodes=3/cpu=32. Here's with and without automatic SQL stats collection respectively.

(All the pink grafana annotations correspond to jobs running, including the auto stats one.)

irfansharif commented 1 year ago

stmt-bundle-871053367241342979.zip is the statement bundle for the trace in https://github.com/cockroachdb/cockroach/issues/102954#issuecomment-1575629949. I can look further (I'm generally interested in understanding how the PR above + automatic table statistics could cause an increase in what looks like KV replication latencies), but I'd like some pointers from @cockroachdb/sql-queries first. It's a simple UPSERT INTO kv (k, v) VALUES ($1:::INT8, $2:::BYTES) into the following schema:

CREATE TABLE public.kv (
    k INT8 NOT NULL,
    v BYTES NOT NULL,
    CONSTRAINT kv_pkey PRIMARY KEY (k ASC)
);

rafiss commented 1 year ago

The PR (https://github.com/cockroachdb/cockroach/pull/101486) that came up in the regression affects how the internal executor inherits session variables. Here is a list of session variables for which the internal executor previously would default to false, but now defaults to true.

optimizer_use_histograms
optimizer_use_multicol_stats
optimizer_use_forecasts
locality_optimized_partitioned_index_scan
enable_implicit_select_for_update
enable_insert_fast_path
enable_implicit_transaction_for_batch_statements
optimizer_always_use_histograms
optimizer_hoist_uncorrelated_equality_subqueries
optimizer_use_improved_computed_column_filters_derivation
streamer_enabled
enable_zigzag_join

Here are a few more that seem worth calling out.

default_transaction_quality_of_service, IE used to default to SystemLow now it uses regular
distsql used to be off, now it is auto
reorder_joins_limit used to be 0, now it is 8
large_full_scan_rows, used to be 0, now it is 1000
use_declarative_schema_changer used to be off, now it is on

mgartner commented 1 year ago

@rafiss does the internal executor always use the hard-coded global defaults now, or does it inherit from the root/admin role's defaults? I'd like to change the settings without building a new binary, if possible.

rafiss commented 1 year ago

If the internal executor is running a query delegated from a user's session, it will inherit settings from that session. This behavior is not new. An example of that is any usage of planner.QueryIteratorEx and related functions.

If the internal executor is not related to any parent session, like if it's used by a background job, then it uses the global defaults as specified in vars.go. This is the new behavior. Previously, it would just use the go zero-value.

One final thing to note is that there are some settings that are always explicitly set by any usage of the internal executor. See https://github.com/cockroachdb/cockroach/blob/ce44bdd8e8feae686621c9b1d8d65dce9a286e10/pkg/sql/internal.go#L772-L790

mgartner commented 1 year ago

I attempted to binary search for the setting causing problems, and I think it might be optimizer_use_forecasts. I need to run some more tests to verify that though, because it was not super consistent in its reproduction.

michae2 commented 1 year ago

So is the theory that optimizer_use_forecasts is changing the plan of an internal query used during automatic stats collection? I believe these are all of the internal queries used during auto stats collection:

mgartner commented 1 year ago

No, I don't have any reason to believe that work of collecting stats is the problem. Rather, the presence of table stats and one or more of these optimizer settings alters the query plans of this workload. The forecasting setting fits the hypothesis nicely because forecasts require both the setting to be enabled and several stats collections. Disabling either the setting or stats would prevent forecasting. I'm working on verifying if it is this setting now.

mgartner commented 1 year ago

I think I was incorrect about optimizer_use_forecasts being the issue. I've narrowed it down to one of these:

optimizer_use_histograms
optimizer_use_multicol_stats
optimizer_always_use_histograms
optimizer_use_improved_computed_column_filters_derivation
streamer_enabled

DrewKimball commented 1 year ago

The statement from the bundle above doesn't have much room for plan changes. What queries from kv0 would we expect to change with different stats? The streamer_enabled setting seems like the odd one out here.

mgartner commented 1 year ago

I've determined that the regression is caused by enabling optimizer_use_histograms. I'll have a PR up shortly to disable it for the internal executor and the roachtest performance should improve. I'll continue to investigate the root cause.

mgartner commented 1 year ago

The statement from the bundle above doesn't have much room for plan changes. What queries from kv0 would we expect to change with different stats?

You're absolutely right. The only type statement executed in this workload is an UPSERT, e.g.:

UPSERT INTO kv(k, v) VALUES (-2503149175035467203, '\x85');

AFAICT, there's really only a single query plan possible. I observed the cluster with the optimizer_use_histograms on and off, and the query plans for the UPSERT look the same.

So the question is, how does a setting that in theory only affects the optimizer have an effect on query throughput when the plans generated are the same?

CPU utilization hovers between 20-25% in the slow case (optimizer_use_histograms=on) and 30% in the fast case (optimizer_use_histograms=off):

We see a similar pattern for disk writes:

And for network usage:

So where is the bottleneck?

Memory usage is slightly higher with histograms enabled, but not egregiously so.

This is shaping up to be a classic whodunit. I'll search for more clues tomorrow.

DrewKimball commented 1 year ago

Is kv a closed-loop workload? If that's the case, maybe histogram processing in the optimizer adding latency to each statement could explain a drop in throughput without much increase in resource usage.

michae2 commented 1 year ago

One idea is to set sql.log.slow_query.internal_queries.enabled and sql.log.slow_query.latency_threshold and see if any internal queries show up in the SQL_INTERNAL_PERF logging channel.

irfansharif commented 1 year ago

CPU utilization hovers between 20-25% in the slow case (optimizer_use_histograms=on) and 30% in the fast case (optimizer_use_histograms=off):

Isn't that just because of reduced foreground throughput in the slow case? (That was my naive understanding -- we could -diff_base CPU profiles to confirm.)

Is kv a closed-loop workload? If that's the case, maybe histogram processing in the optimizer adding latency to each statement could explain a drop in throughput without much increase in resource usage.

What's confusing to me is where the latency seems to be added. See traces here: https://github.com/cockroachdb/cockroach/issues/102954#issuecomment-1575629949. It seems to be added during synchronous KV-level replication. So I'm quite confused. optimizer_use_histograms being implicated somehow has only added to the confusion. But for your question, yes, if there's a significant increase in E2E latency (like we see here) we expect a corresponding drop in kv0 throughput. It doesn't have to be a resource bottleneck.

rafiss commented 1 year ago

The sql.stats.response.show_internal.enabled cluster setting could also be turned on so that we can see internal queries in the DB console statement activity.

mgartner commented 1 year ago

Slight correction to my previous line of thinking: The optimizer_use_histograms setting I changed does not affect user queries, only queries performed via the internal executor. So the UPSERT query plans wouldn't be affected, even if there was multiple possible plans for them.

mgartner commented 1 year ago

Ran a test that I believe proves that auto-stats collection is not affecting foreground queries. I started the workload with auto-stats and histograms enabled. After throughput tanked, I disabled auto-stats collection and observed that throughput did not improve. I think this confirms that it's the presence of collected stats and usage of the histograms that leads to the problem.

irfansharif commented 1 year ago

Rotating out of L2, re-assigning to @pavelkalinnikov. @nicktrav, it might also make sense to pull this out of the L2 rotation at this point, but I'll defer to you.

mgartner commented 1 year ago

I've narrowed the problem down to the table statistics collected for the system.protected_ts_meta and sytstem.role_options tables (table ID 31 and 33 respectively). When the stats collected for those tables are deleted, query latency improves:

Interestingly, the stats for both tables are empty. For example, the stats for system.protected_ts_meta:

root@localhost:26257/defaultdb> select "tableID", "statisticID", "createdAt", "rowCount" from system.table_statistics where "tableID" >= 30 AND "tableID" < 40;
  tableID |    statisticID     |         createdAt         | rowCount
----------+--------------------+---------------------------+-----------
       31 | 872029099613716481 | 2023-06-08 02:49:21.74417 |        0
       31 | 872029099625775105 | 2023-06-08 02:49:21.74417 |        0
       31 | 872029099642716161 | 2023-06-08 02:49:21.74417 |        0
       31 | 872029099661721601 | 2023-06-08 02:49:21.74417 |        0
       31 | 872029099678105601 | 2023-06-08 02:49:21.74417 |        0
(5 rows)

root@localhost:26257/defaultdb> show histogram 872029099625775105;
SHOW HISTOGRAM 0

root@localhost:26257/defaultdb> show histogram 872029099613716481;
SHOW HISTOGRAM 0

root@localhost:26257/defaultdb> show histogram 872029099642716161;
SHOW HISTOGRAM 0

root@localhost:26257/defaultdb> show histogram 872029099661721601;
SHOW HISTOGRAM 0

root@localhost:26257/defaultdb> show histogram 872029099678105601;
SHOW HISTOGRAM 0

root@localhost:26257/defaultdb> show statistics using json for table system.protected_ts_meta;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   statistics
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{"avg_size": 0, "columns": ["singleton"], "created_at": "2023-06-08 03:22:18.764739", "distinct_count": 0, "histo_col_type": "BOOL", "histo_version": 2, "name": "__auto__", "null_count": 0, "row_count": 0}, {"avg_size": 0, "columns": ["version"], "created_at": "2023-06-08 03:22:18.764739", "distinct_count": 0, "histo_col_type": "INT8", "histo_version": 2, "name": "__auto__", "null_count": 0, "row_count": 0}, {"avg_size": 0, "columns": ["num_records"], "created_at": "2023-06-08 03:22:18.764739", "distinct_count": 0, "histo_col_type": "INT8", "histo_version": 2, "name": "__auto__", "null_count": 0, "row_count": 0}, {"avg_size": 0, "columns": ["num_spans"], "created_at": "2023-06-08 03:22:18.764739", "distinct_count": 0, "histo_col_type": "INT8", "histo_version": 2, "name": "__auto__", "null_count": 0, "row_count": 0}, {"avg_size": 0, "columns": ["total_bytes"], "created_at": "2023-06-08 03:22:18.764739", "distinct_count": 0, "histo_col_type": "INT8", "histo_version": 2, "name": "__auto__", "null_count": 0, "row_count": 0}]

mgartner commented 1 year ago

I'm not sure if this is a symptom or a cause, but I noticed that the decrease in SQL QPS corresponds to an increase in consistency queue processing latency:

mgartner commented 1 year ago

I've narrowed the problem down to the table statistics collected for the system.protected_ts_meta and sytstem.role_options tables (table ID 31 and 33 respectively). When the stats collected for those tables are deleted, query latency improves:

@adityamaru @ecwall @Xiang-Gu If you can provide any context about how these tables would be accessed during a simple kv0 workload (UPSERTS into a single, simple table only), that would be appreciated!

rafiss commented 1 year ago

For the kv0 workload, system.role_options should only be getting accessed when establishing a new connection. The other usages are for privilege checks for DDL operations or cluster management operations, which I don't think this workload does.

michae2 commented 1 year ago

Interestingly, the stats for both tables are empty.

I bet with the stats we're now doing a full scan for some internal query that was using an index before.

mgartner commented 1 year ago

I bet with the stats we're now doing a full scan for some internal query that was using an index before.

I agree, that sounds likely. I'm trying to track down the problematic query now.

adityamaru commented 1 year ago

If you can provide any context about how these tables would be accessed during a simple kv0 workload (UPSERTS into a single, simple table only)

Protected timestamps are only relevant for some schema changes, backups, changefeeds and restores. So a simple workload with upserts without background backup schedules should not be writing any PTS records. Furthermore, the protected_ts_meta is only tracking metadata about the records and not the actual records themselves that are in protected_ts_records. This metadata is only used to implement some safeguards such as avoiding too many records in a cluster. So tldr; its not a very important table especially in the context of this workload.

mgartner commented 1 year ago

I've narrowed the problem down to the table statistics collected for the system.protected_ts_meta and sytstem.role_options tables (table ID 31 and 33 respectively). When the stats collected for those tables are deleted, query latency improves:

This statement was true for several runs (about 10 or more) yesterday evening and a handful more this morning. But in two runs in a row just now, deleting these stats has had no effect. So it's possible this is a red herring. It seems supported by the fact that AFAICT there are no queries involving role_options running on the cluster (as Rafi suggested), and there was only a single query on protected_ts_meta that was executed occasionally and it appeared unremarkable.

mgartner commented 1 year ago

This top-down method of investigation doesn't appear to be making much progress. The presence of table histograms and their usage by the optimizer appears to be required to reproduce this regression, but there is no plausible explanation for this at this point. Moreover, the regression does not appear dependent on histograms of specific tables - the regression is solved by deleting table statistics for different tables in different runs of the roach test, as far as I can tell.

Based on my observations above (https://github.com/cockroachdb/cockroach/issues/102954#issuecomment-1579470277), I think that mutex contention could be the culprit. I compared the mutex profiles of nodes during good and poor performance, and the only thing that stood out to me was that in the poor performing case I observed time spent unlocking mutexes in SQL metrics code like ssmemstorage.Container.NewApplicationStatsWithInheritedOptions. However, I created a custom binary that disabled most of the SQL metrics logic and there was no performance improvement.

I think we should reverse our approach and start at the bottom. We know that ~1% of UPSERTs in this workload have 100+ms latencies, and that a large portion of that time is spent in kv (see https://github.com/cockroachdb/cockroach/issues/102954#issuecomment-1575629949). @pavelkalinnikov what tools does the KV team have to determine the source of increased latencies as seen in these bundles: https://github.com/cockroachdb/cockroach/issues/102954#issuecomment-1574216980?

srosenberg commented 1 year ago

Drive-by question/comment... given that the regression is not negligible (~66%), there has to be a large diff in the CPU profiles, no? I only see one other reference to CPU profiles, https://github.com/cockroachdb/cockroach/issues/102954#issuecomment-1579490893; is there really nothing that stands out in the two flamegraphs?

mgartner commented 1 year ago

As shown here, CPU utilization goes down when latency increases and throughput decreases.

srosenberg commented 1 year ago

As shown here, CPU utilization goes down when latency increases and throughput decreases.

I see the system metrics but not the flame graphs. Sometimes they tell different stories; could we share the flamegraphs for posterity, even if they aren't very telling in this case.

mgartner commented 1 year ago

Sure. Here's profiles from each of the three nodes taken while performance was good ("fast") and while it was poor ("slow"):

profiles.zip

cucaroach commented 1 year ago

I think the culprit here is something taking critical path goroutines off the cpu, we need a way to filter cpu profiles/mutex profiles/allocation profiles etc by request path code (the main SQL goroutine and any dependent KV goroutines). Its impossible for the system to know which are the critical path goroutines so I think we need to tag them somehow. This way we can filter out all the not critical path stuff. Does the problem manifest with 1 node?

mgartner commented 1 year ago

I think the culprit here is something taking critical path goroutines off the cpu

What's an example of something that would do this?

Does the problem manifest with 1 node?

I'll try to verify this.

cucaroach commented 1 year ago

I think the culprit here is something taking critical path goroutines off the cpu

What's an example of something that would do this?

The prime examples would be mutexes, allocations leading to GC activity, goroutine/channel handoffs and scheduler doing what it does. The problem is these events happen all the time and most of them are benign so we need a way to zero in on when they happen on goroutines servicing the request.

cucaroach commented 1 year ago

So I tried to dig into this a little with the following approach. Turn on sql.log.slow_query.latency_threshold setting it to 100ms. Take trace samples every 5s while workload is running. Find goroutine id for a slow update query from the log and then filter the trace for that goroutine. This lets us iterate through the samples for that goroutine. Its a little of an inexact science but it seems to work, I'm just having trouble lining up the timestamps to know which trace samples correspond to my slow query instance but I think in theory this approach should work, just need to refine the technique a bit.

cockroachdb / cockroach

roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor] #102954