roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor]

kvoli commented 1 year ago

kv0/nodes=3/cpu=96 had a serious (66%) regression in terms of ops/s since the Apr 30th run. The regression occurs on both GCE and AWS.

Jira issue: CRDB-27754

cucaroach commented 1 year ago

Hmm, did this bug resolve itself?

https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=aws https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=gce

irfansharif commented 1 year ago

No, that's just the result of https://github.com/cockroachdb/cockroach/pull/104443.

cucaroach commented 1 year ago

Oh good, the hunt is on!

cucaroach commented 1 year ago

Possibly related; what I saw looking into this had some shadows of what I see looking into COPY performance:

https://github.com/cockroachdb/cockroach/issues/102755#issuecomment-1598494535

mgartner commented 1 year ago

I have a hunch that this is could be related to the problem described here: https://github.com/cockroachlabs/support/issues/2423#issuecomment-1645996297

A few things don't fully add up though, like why doesn't CPU usage increase in these benchmarks? I'll run some more tests and see if I can prove or disprove anything.

mgartner commented 1 year ago

I have a hunch that this is could be related to the problem described here: https://github.com/cockroachlabs/support/issues/2423#issuecomment-1645996297

I ran a test and didn't see any relation to that issue, so nevermind.

cucaroach commented 1 year ago

Okay getting somewhere, in order to repro the slow behavior its only necessary to enable optimizer histograms on the create stats job internal query launched here:

https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/sql/create_stats.go#L592

So I guess telling the optimizer to use histogram stats on the queries executed while we are creating table statistics causes the malfunction. But to repeat all the findings above, there's nothing in the cpu/memory/mutex profiles that indicates what's going on, enabling the optimizer histograms just magically makes the the latency on some of the KV requests go way up.

I'll keeping zooming in...

srosenberg commented 1 year ago

We recently saw a regression due to the high cost of RWMutex for vCPUs >= 32 [1]. In this case, does the regression occur on cpu=32 or cpu=8? Or is it only happening on cpu=96?

[1] https://github.com/cockroachdb/cockroach/issues/109443

cucaroach commented 1 year ago

It occurs for 32 and 96, it doesn't appear to happen with 8. Marcus did some experiments to eliminate the SQL stats.

rafiss commented 1 year ago

It may not be SQLStats, but could there be some other RWMutex that is used when histograms are enabled?

cucaroach commented 1 year ago

Okay so I have run a bunch of experiments and I have some theories what's going on here but could use some help from the kv folks nailing it down further.

Basically what happens is that a minute or 10 after we let the internal executor use histograms in the optimizer we experience a throughput collapse. This is very reliable and happens every time and never happens with histograms fully disabled. This diagram shows it clearly, I added a cluster setting to disable Marcus's workaround and flipped it at 14:42.

Screenshot 2023-09-18 at 10 55 35 AM

The only thing I could find that correlates with this is raft replication latency:

Screenshot 2023-09-18 at 10 54 33 AM

Here it is w/ total KV latency and broken out by node:

Screenshot 2023-09-18 at 12 35 49 PM

So the raft replication latency doesn't explain all the KV latency but it appears to be the biggest contributor.

I will continue to focus on what internal queries change plans when this switch is flipped but so far it doesn't look like its any particular one and no internal queries appear to be contributing to resource consumption in any meaningful way but I have seen some plans doing full table scans where an index could be used but the tables are tiny so it doesn't make sense for the throughput to collapse. It definitely seems like a query execution butterfly wing flap leads to a raft throughput collapse hurricane.

I have found that using the splt=0 variation of this test (ie no 1000 pre-splits) increases performance noticeably as does doubling the the COCKROACH_SCHEDULER_SHARD_SIZE to 32 so there's definitely some throughput limiting friction in the raft/replica area but neither intervention stayed off the throughput collapse. I didn't play with raft scheduler concurrency.

I don't think the mutex contention as reported by GO indicates anything meaningful, we do see some RWMutex contention but its not meaningfully different before/after the collapse. Ditto for memory allocations.

michae2 commented 1 year ago

Nice investigation, @cucaroach!

This is very reliable and happens every time and never happens with histograms fully disabled.

Were histograms enabled for internal executor queries for 23.1, or only for 23.2? If only for 23.2 then we could disable them until we find the cause of the regression.

The only thing I could find that correlates with this is raft replication latency:

This makes me think an internal mutation query has started writing to more rows. Is there a corresponding increase in any of these metrics?

distsender.rpc.put.sent
distsender.rpc.initput.sent
distsender.rpc.conditionalput.sent
distsender.rpc.deleterange.sent
distsender.rpc.delete.sent
distsender.rpc.get.sent
distsender.rpc.scan.sent

I will continue to focus on what internal queries change plans when this switch is flipped but so far it doesn't look like its any particular one and no internal queries appear to be contributing to resource consumption in any meaningful way but I have seen some plans doing full table scans where an index could be used but the tables are tiny so it doesn't make sense for the throughput to collapse. It definitely seems like a query execution butterfly wing flap leads to a raft throughput collapse hurricane.

What method are you using to find changed plans? One thing I like to do is look at the createAt time of statistics in system.table_statistics, as usually stats are the reason plans change (maybe you're already doing this). Is there one particular table that always has a stats collection just before the beginning of the increase in raft latency?

cucaroach commented 1 year ago

If anyone wants to poke around: http://treilly-32-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster?start=1695047430&end=1695051000

cucaroach commented 1 year ago

Were histograms enabled for internal executor queries for 23.1, or only for 23.2? If only for 23.2 then we could disable them until we find the cause of the regression.

This was already done but we wanted to remove that hack for 23.2.

This makes me think an internal mutation query has started writing to more rows. Is there a corresponding increase in any of these metrics?

No, the puts done by the KV workload fall off and the other stuff is noise.

What method are you using to find changed plans? One thing I like to do is look at the createAt time of statistics in system.table_statistics, as usually stats are the reason plans change (maybe you're already doing this). Is there one particular table that always has a stats collection just before the beginning of the increase in raft latency?

I hacked in internal executor SQL logging so I can get logs of timestamped internal query executions and then I run queries against crdb_internal.statement_statistics to pull out the planGists for the queries I see running around the time that the throughput collapse occurs.

Mostly we just generate stats for the KV table although I see occasion stats being generated for transaction-statistics. Note that in the both the good and bad cases we are collecting and saving table stats and histograms. The only difference is that in the bad case the optimizer uses them for internal queries.

DrewKimball commented 1 year ago

Write IOPs appear to increase somewhat at around the right time. That seems interesting, considering the SQL ops/s drops. Do you know what the provisioned IOPs is for these machines? It looks like it reaches around 10-11k post the change.

michae2 commented 1 year ago

Thanks for the response. I looked at system.table_statistics, but no stats were created just before the performance degradation at 14:45:

root@localhost:26257/system/defaultdb> SELECT DISTINCT "tableID", name, "createdAt" FROM system.table_statistics WHERE "createdAt" > '2023-09-18 00:00' ORDER BY "createdAt";
  tableID |   name   |         createdAt
----------+----------+-----------------------------
...
       27 | __auto__ | 2023-09-18 14:31:51.521065
       41 | __auto__ | 2023-09-18 14:31:51.721491
       35 | __auto__ | 2023-09-18 14:32:17.674131
       52 | __auto__ | 2023-09-18 14:32:40.474798
      106 | __auto__ | 2023-09-18 14:53:51.746017
      106 | __auto__ | 2023-09-18 14:56:51.037568
...

I also looked at jobs, but again nothing that matches the timing:

root@localhost:26257/system/defaultdb> SELECT * FROM system.jobs WHERE created > '2023-09-18 00:00' ORDER BY created;
          id         |  status   |          created           | payload | progress | created_by_type |   created_by_id    |             claim_session_id             | claim_instance_id | num_runs |          last_run          |            job_type
---------------------+-----------+----------------------------+---------+----------+-----------------+--------------------+------------------------------------------+-------------------+----------+----------------------------+----------------------------------
...
  901046877327687683 | succeeded | 2023-09-18 14:41:24.731369 | NULL    | NULL     | NULL            |               NULL | \x0101805d8ef89dd97c42a8ab52454b91642f01 |                 3 |        1 | 2023-09-18 14:41:24.782088 | AUTO CREATE STATS
  901046948223025153 | succeeded | 2023-09-18 14:41:46.36077  | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:41:51.327912 | SCHEMA CHANGE GC
  901046948253073409 | succeeded | 2023-09-18 14:41:46.36077  | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:41:46.437448 | SCHEMA CHANGE
  901046948747083777 | succeeded | 2023-09-18 14:41:46.522472 | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:41:51.32455  | SCHEMA CHANGE GC
  901046948776312833 | succeeded | 2023-09-18 14:41:46.522472 | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:41:46.590447 | SCHEMA CHANGE
  901046949218254849 | succeeded | 2023-09-18 14:41:46.66744  | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:41:51.328001 | SCHEMA CHANGE GC
  901046949247942657 | succeeded | 2023-09-18 14:41:46.66744  | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:41:46.731614 | SCHEMA CHANGE
  901046949713641473 | succeeded | 2023-09-18 14:41:46.819521 | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:41:51.328083 | SCHEMA CHANGE GC
  901046949746147329 | succeeded | 2023-09-18 14:41:46.819521 | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:41:46.879064 | SCHEMA CHANGE
  901048308970815490 | succeeded | 2023-09-18 14:48:41.634174 | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:48:41.735556 | AUTO CREATE STATS
  901048672790806530 | succeeded | 2023-09-18 14:50:32.663231 | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:50:32.730955 | AUTO CREATE STATS
  901048674668216322 | succeeded | 2023-09-18 14:50:33.236169 | NULL    | NULL     | NULL            |               NULL | \x01018045100fa4cd884a53ab5d429d743d24dc |                 2 |        1 | 2023-09-18 14:50:33.348989 | AUTO CREATE STATS
  901048698456866819 | succeeded | 2023-09-18 14:50:40.495888 | NULL    | NULL     | NULL            |               NULL | \x0101805d8ef89dd97c42a8ab52454b91642f01 |                 3 |        1 | 2023-09-18 14:50:40.538714 | AUTO CREATE STATS
  901049492025999361 | succeeded | 2023-09-18 14:54:42.674084 | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:54:42.910066 | AUTO CREATE STATS
  901049913960300545 | succeeded | 2023-09-18 14:56:51.438099 | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:56:51.488154 | AUTO CREATE STATS
  901049914488684545 | succeeded | 2023-09-18 14:56:51.59935  | NULL    | NULL     | NULL            |               NULL | \x01018088a283b49ca741a498716c355b77bcab |                 1 |        1 | 2023-09-18 14:56:51.64317  | AUTO CREATE STATS
...

Also looked at logs:

I230918 14:42:20.236365 6664475 util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=127.0.0.1:37544,hostnossl,user=root] 669 ={"Timestamp":1695048140231031694,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"sql.internal.disable_optimizer_use_histograms.enabled\" = false","Tag":"SET CLUSTER SETTING","User":"root","ApplicationName":"$ cockroach sql","SettingName":"sql.internal.disable_optimizer_use_histogram>
I230918 14:42:41.962897 6587779 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 670  automatically executing ‹"CREATE STATISTICS __auto__ FROM [12] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:42:41.970946 6587779 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 671  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:41.977480 7140863 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 672  automatically executing ‹"CREATE STATISTICS __auto__ FROM [12] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:41.988749 7140863 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 673  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:42.004626 7140863 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 674  automatically executing ‹"CREATE STATISTICS __auto__ FROM [28] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:42.506704 7171788 github.com/google/pprof/internal/driver/webui.go:140 ⋮ [pprof] 675  ‹Serving web UI on http://localhost:0›
I230918 14:43:58.057615 1108228 github.com/google/pprof/internal/driver/webui.go:140 ⋮ [pprof] 676  ‹Serving web UI on http://localhost:0›
I230918 14:44:13.197410 712 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n1,s1] 677  IO overload: compaction score 0.050 (3 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 577448 (1054 bypassed) with 72 MiB acc-write (13 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.95x+1 B (smoothed 0.94x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens>
I230918 14:44:28.197995 712 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n1,s1] 678  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 576219 (1064 bypassed) with 73 MiB acc-write (13 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.93x+1 B (smoothed 0.94x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens>
I230918 14:44:42.018733 7681206 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 679  automatically executing ‹"CREATE STATISTICS __auto__ FROM [12] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:44:42.025829 7681206 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 680  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:44:42.032732 7681206 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 681  automatically executing ‹"CREATE STATISTICS __auto__ FROM [28] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:44:42.540521 1108228 github.com/google/pprof/internal/driver/webui.go:140 ⋮ [pprof] 682  ‹Serving web UI on http://localhost:0›
I230918 14:45:12.205022 160 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n1] 683  ‹kv›: FIFO threshold for tenant 1 is -128
I230918 14:45:12.305242 162 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n1] 684  ‹sql-kv-response›: FIFO threshold for tenant 1 is -128
I230918 14:45:13.198773 712 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n1,s1] 685  IO overload: compaction score 0.050 (7 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 532123 (975 bypassed) with 66 MiB acc-write (12 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.03x+1 B (smoothed 0.97x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens >
I230918 14:45:13.305106 708 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n1] 686  ‹kv›: FIFO threshold for tenant 1 is -128
I230918 14:45:28.199037 712 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n1,s1] 687  IO overload: compaction score 0.050 (3 ssts, 1 sub-levels), L0 growth 55 MiB (write 55 MiB ingest 0 B ignored 0 B): requests 509127 (907 bypassed) with 65 MiB acc-write (11 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.85x+1 B (smoothed 0.91x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens >
W230918 14:45:32.850835 2907 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n1,rnode=3,raddr=‹10.142.1.51:26257›,class=default,rpc] 688  latency jump (prev avg 5.47ms, current 50.98ms)
I230918 14:45:42.039594 8199117 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 689  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:42.181663 8199117 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 690  automatically executing ‹"CREATE STATISTICS __auto__ FROM [28] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:42.188788 8199117 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n1] 691  automatically executing ‹"CREATE STATISTICS __auto__ FROM [12] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:43.199292 712 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n1,s1] 692  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 55 MiB (write 55 MiB ingest 0 B ignored 0 B): requests 467800 (869 bypassed) with 59 MiB acc-write (10 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.93x+1 B (smoothed 0.92x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens >

I230918 14:42:41.390736 6986719 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 743  automatically executing ‹"CREATE STATISTICS __auto__ FROM [42] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:42:41.398380 6986719 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 744  automatically executing ‹"CREATE STATISTICS __auto__ FROM [43] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:42:41.405037 6986719 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 745  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:41.410856 7575515 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 746  automatically executing ‹"CREATE STATISTICS __auto__ FROM [43] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:41.418226 7575515 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 747  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:43:41.424449 7575515 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 748  automatically executing ‹"CREATE STATISTICS __auto__ FROM [42] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:44:00.965468 574 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n2,s2] 749  IO overload: compaction score 0.050 (3 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 580746 (1085 bypassed) with 73 MiB acc-write (2.0 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.93x+1 B (smoothed 0.93x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:44:15.965752 574 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n2,s2] 750  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 576379 (1084 bypassed) with 72 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.94x+1 B (smoothed 0.94x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:44:30.965923 574 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n2,s2] 751  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 576335 (1084 bypassed) with 73 MiB acc-write (2.0 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.94x+1 B (smoothed 0.94x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:44:41.430658 8151876 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 752  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:44:41.438960 8151876 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 753  automatically executing ‹"CREATE STATISTICS __auto__ FROM [43] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:44:41.446140 8151876 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 754  automatically executing ‹"CREATE STATISTICS __auto__ FROM [42] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:16.005109 570 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n2] 755  ‹kv›: FIFO threshold for tenant 1 is -128
I230918 14:45:16.006315 136 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n2] 756  ‹sql-kv-response›: FIFO threshold for tenant 1 is -128
I230918 14:45:16.105008 134 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n2] 757  ‹kv›: FIFO threshold for tenant 1 is -128
I230918 14:45:30.966327 574 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n2,s2] 758  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 498902 (1166 bypassed) with 63 MiB acc-write (2.0 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.08x+1 B (smoothed 0.98x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:45:41.454709 8703920 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 759  automatically executing ‹"CREATE STATISTICS __auto__ FROM [43] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:41.502119 8703920 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 760  automatically executing ‹"CREATE STATISTICS __auto__ FROM [42] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:41.508785 8703920 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n2] 761  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:45:45.966610 574 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n2,s2] 762  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 56 MiB (write 56 MiB ingest 0 B ignored 0 B): requests 468003 (884 bypassed) with 59 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.93x+1 B (smoothed 0.95x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-tokens>

I230918 14:41:24.726717 5513799 sql/stats/automatic_stats.go:899 ⋮ [T1,Vsystem,n3] 592  automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230918 14:41:24.777125 5778998 jobs/registry.go:1619 ⋮ [T1,Vsystem,n3] 593  AUTO CREATE STATS job 901046877327687683: stepping through state running
I230918 14:41:46.422231 5967924 sql/catalog/lease/descriptor_state.go:142 ⋮ [n3] 594  new lease: tabledesc.immutable: {ID: 42, Version: 2, ModificationTime: "1695048106.360770375,0", ParentID: 1, ParentSchemaID: 29, State: PUBLIC, NextColumnID: 20, Columns: [{ID: 1, TypeID: 1184, Null: false}, {ID: 2, TypeID: 17, Null: false}, {ID: 3, TypeID: 17, Null: false}, {ID: 4, TypeID: 17, Null: false}, {ID: 5, TypeID: 25, Null: fals>
I230918 14:41:46.574375 5968672 sql/catalog/lease/descriptor_state.go:142 ⋮ [n3] 595  new lease: tabledesc.immutable: {ID: 43, Version: 2, ModificationTime: "1695048106.522471789,0", ParentID: 1, ParentSchemaID: 29, State: PUBLIC, NextColumnID: 15, Columns: [{ID: 1, TypeID: 1184, Null: false}, {ID: 2, TypeID: 17, Null: false}, {ID: 3, TypeID: 25, Null: false}, {ID: 4, TypeID: 20, Null: false}, {ID: 5, TypeID: 1186, Null: fa>
I230918 14:41:46.716111 5969951 sql/catalog/lease/descriptor_state.go:142 ⋮ [n3] 596  new lease: tabledesc.immutable: {ID: 60, Version: 2, ModificationTime: "1695048106.667439900,0", ParentID: 1, ParentSchemaID: 29, State: PUBLIC, NextColumnID: 18, Columns: [{ID: 1, TypeID: 1184, Null: false}, {ID: 2, TypeID: 17, Null: false}, {ID: 3, TypeID: 17, Null: false}, {ID: 4, TypeID: 17, Null: false}, {ID: 5, TypeID: 25, Null: fals>
I230918 14:41:46.864631 5971651 sql/catalog/lease/descriptor_state.go:142 ⋮ [n3] 597  new lease: tabledesc.immutable: {ID: 61, Version: 2, ModificationTime: "1695048106.819520697,0", ParentID: 1, ParentSchemaID: 29, State: PUBLIC, NextColumnID: 15, Columns: [{ID: 1, TypeID: 1184, Null: false}, {ID: 2, TypeID: 17, Null: false}, {ID: 3, TypeID: 25, Null: false}, {ID: 4, TypeID: 1186, Null: false}, {ID: 5, TypeID: 3802, Null: >
I230918 14:43:47.739432 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 598  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 582253 (1151 bypassed) with 74 MiB acc-write (2.0 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.93x+1 B (smoothed 0.92x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:44:02.739702 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 599  IO overload: compaction score 0.100 (8 ssts, 2 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 579644 (1151 bypassed) with 73 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.94x+1 B (smoothed 0.93x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:44:47.740216 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 600  IO overload: compaction score 0.050 (6 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 569285 (1138 bypassed) with 72 MiB acc-write (2.0 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.95x+1 B (smoothed 0.92x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:45:02.740479 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 601  IO overload: compaction score 0.050 (7 ssts, 1 sub-levels), L0 growth 69 MiB (write 69 MiB ingest 0 B ignored 0 B): requests 544684 (1309 bypassed) with 68 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.00x+1 B (smoothed 0.96x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:45:17.805992 169 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n3] 602  ‹kv›: FIFO threshold for tenant 1 is -128
I230918 14:45:17.806025 171 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n3] 603  ‹sql-kv-response›: FIFO threshold for tenant 1 is -128
I230918 14:45:17.905048 619 util/admission/work_queue.go:538 ⋮ [T1,Vsystem,n3] 604  ‹kv›: FIFO threshold for tenant 1 is -128
I230918 14:45:32.740953 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 605  IO overload: compaction score 0.050 (3 ssts, 1 sub-levels), L0 growth 55 MiB (write 55 MiB ingest 0 B ignored 0 B): requests 499659 (1240 bypassed) with 63 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.87x+1 B (smoothed 0.88x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:45:47.742058 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 606  IO overload: compaction score 0.050 (3 ssts, 1 sub-levels), L0 growth 56 MiB (write 56 MiB ingest 0 B ignored 0 B): requests 459336 (947 bypassed) with 58 MiB acc-write (2.0 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.94x+1 B (smoothed 0.91x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-tokens>
I230918 14:46:02.743555 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 607  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 55 MiB (write 55 MiB ingest 0 B ignored 0 B): requests 433408 (871 bypassed) with 55 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.00x+1 B (smoothed 0.96x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-tokens>
I230918 14:47:47.746423 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 608  IO overload: compaction score 0.050 (4 ssts, 1 sub-levels), L0 growth 42 MiB (write 42 MiB ingest 0 B ignored 0 B): requests 337652 (1387 bypassed) with 43 MiB acc-write (2.1 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.95x+1 B (smoothed 0.96x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-token>
I230918 14:48:00.515193 657 server/server_systemlog_gc.go:211 ⋮ [T1,Vsystem,n3] 609  garbage collected 0 rows from ‹rangelog›.‹timestamp›
I230918 14:48:00.519075 657 server/server_systemlog_gc.go:211 ⋮ [T1,Vsystem,n3] 610  garbage collected 0 rows from ‹eventlog›.‹timestamp›
I230918 14:48:00.523388 657 server/server_systemlog_gc.go:211 ⋮ [T1,Vsystem,n3] 611  garbage collected 0 rows from ‹web_sessions›.‹expiresAt›
I230918 14:48:00.526730 657 server/server_systemlog_gc.go:211 ⋮ [T1,Vsystem,n3] 612  garbage collected 0 rows from ‹web_sessions›.‹revokedAt›
I230918 14:48:02.747584 623 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n3,s3] 613  IO overload: compaction score 0.050 (7 ssts, 1 sub-levels), L0 growth 42 MiB (write 42 MiB ingest 0 B ignored 0 B): requests 323931 (771 bypassed) with 42 MiB acc-write (1.8 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.99x+1 B (smoothed 0.98x+1 B) + ingested-model 0.00x+0 B (smoothed 0.09x+1 B) + at-admission-tokens>
I230918 14:48:19.686829 754 sql/sqlstats/persistedsqlstats/flush.go:72 ⋮ [T1,Vsystem,n3] 614  flushing 28 stmt/txn fingerprints (184320 bytes) after 1.563µs
I230918 14:48:19.687102 754 sql/sqlstats/persistedsqlstats/flush.go:113 ⋮ [T1,Vsystem,n3] 615  PersistedSQLStats.StmtsLimitSizeReached skipped with last check at: 2023-09-18 14:39:29.200003963 +0000 UTC m=+551.929349641 and check interval: 1h0m0s
W230918 14:48:31.996490 1700 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n3,rnode=2,raddr=‹10.142.1.120:26257›,class=default,rpc] 616  latency jump (prev avg 3.71ms, current 79.86ms)
I230918 14:48:40.454198 5778998 jobs/registry.go:1619 ⋮ [T1,Vsystem,n3] 617  AUTO CREATE STATS job 901046877327687683: stepping through state succeeded

So there was a stats collection of the workload table (106) that ran from 14:41 to 14:48. But that doesn't really line up with the timing very well either.

Write IOPs appear to increase somewhat at around the right time. That seems interesting, considering the SQL ops/s drops. Do you know what the provisioned IOPs is for these machines? It looks like it reaches around 10-11k post the change.

Good eye, Drew. The GCP metrics also show the increase in IOPs: https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-b/instances/treilly-32-0001?project=cockroach-ephemeral&tab=monitoring&pageState=(%22observabilityTab%22:(%22mainContent%22:%22metrics%22,%22section%22:%22diskPerformance%22),%22duration%22:(%22groupValue%22:%22P1D%22,%22customValue%22:null))

That being said, the limit should be 90k write IOPs per local NVMe SSD, and these machines have 4 disks each: https://cloud.google.com/compute/docs/disks/local-ssd#nvme (we're only getting up to about 1k or 2k write IOPs per disk)

michae2 commented 1 year ago

Okay getting somewhere, in order to repro the slow behavior its only necessary to enable optimizer histograms on the create stats job internal query launched here:

https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/sql/create_stats.go#L592

So I guess telling the optimizer to use histogram stats on the queries executed while we are creating table statistics causes the malfunction. But to repeat all the findings above, there's nothing in the cpu/memory/mutex profiles that indicates what's going on, enabling the optimizer histograms just magically makes the the latency on some of the KV requests go way up.

I'll keeping zooming in...

@cucaroach if you could narrow this down to one specific stats collection query out of the ones listed above, that would help.

cucaroach commented 1 year ago

Oh sorry I left out a very import part of the investigation. I wasn't able to nail this behavior down to a particular query/table. The bit about create_stats.go being involved was wrong, I just wasn't waiting long enough for the collapse to happen. After that I extended the test to 60 minutes to make sure I was/wasn't seeing the behavior.

What I did was create an override for turning off optimizer histograms at the query execution site and would disable queries executed right before the collapse and then re-run. The behavior happens right after a query runs with histograms enabled and it doesn't seem to matter much what query it was. I repeated this process ~30 times and the behavior would still happen. I literally had to disable almost all queries to get the collapse to stop. This led to lot of time spent trying to understand what kind of knock on affect enabling histograms from what I could tell it really is just a matter of copying the histogram slice out of the stats cache and possibly generating some different query plans.

Hopefully by focusing on the queries that generate different plans it will become clearer what's going on.

michae2 commented 1 year ago

I see, interesting. Thank you for digging into this so deeply. I can see why this has proved so elusive.

erikgrinaker commented 1 year ago

doubling the the COCKROACH_SCHEDULER_SHARD_SIZE to 32

I wonder if this is because it increases Raft scheduler mutex contention, which alleviates pressure on some pathologically contended resource elsewhere in the system. Did Raft scheduler latency increase or decrease with this change?

srosenberg commented 1 year ago

This is quite a puzzle! Admission control kicks in at 14:45 (see KV Admission Slots); shortly thereafter, concurrency (as measured by runnable goroutines) drops.

Any insights from the goroutine profiles? Also, was the workload running on n4? I couldn't find any system metrics (node_exporter) for it or the cluster nodes. We should enable node_exporter if only to rule out anything lower in the stack.

cucaroach commented 1 year ago

I wonder if this is because it increases Raft scheduler mutex contention, which alleviates pressure on some pathologically contended resource elsewhere in the system. Did Raft scheduler latency increase or decrease with this change?

Raft scheduler latency is high when performance is degraded and low when it isn't. With histograms disabled P99 is like 25ms, when they are enabled its like >300ms. COCKROACH_SCHEDULER_SHARD_SIZE doesn't seem to change that.

Here's the timebounds for the 32 shard experiment I did this morning:

http://treilly-32-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster?start=1695132979&end=1695134779

DrewKimball commented 1 year ago

@cucaroach If we want to make sure that this regression is due to plan changes and not something more exotic like mutex contention, maybe we could try running a build that reads/builds the histogram, but otherwise ignores it when calculating stats.

mgartner commented 1 year ago

@cucaroach If we want to make sure that this regression is due to plan changes and not something more exotic like mutex contention, maybe we could try running a build that reads/builds the histogram, but otherwise ignores it when calculating stats.

My understanding is that this is exactly what optimizer_use_histograms is controlling. When it is off, we are still collecting histograms for system tables, but we're not considering those histograms when calculating memo statistics.

It's not immediately clear looking at where that setting is queries to know if it would prove/disprove query plan changes or some exotic mutex contention as the culprit here though.

cucaroach commented 1 year ago

I don't think mutex contention is the issue but I could be wrong. I see roughly the same amount of mutex contention in the good and bad cases but it is squarely all centered around RWMutex's and lots related to the ReplicaMutex. Here's the good case:

Screenshot 2023-09-19 at 3 46 22 PM

Here's the bad case:

Screenshot 2023-09-19 at 3 46 41 PM

I feel like with 32 CPU's and 64 client sessions 3-4s of contention isn't terrible, these are 3s mutex profile deltas. Certainly an experiment with DRWMutex is warranted but I'm focused on examining plans right now, the changing plans must change the KV request demographics somehow. As an experiment to guage whether this amount of contention was problematic I completely commented out the statsCollector Start/EndTransaction calls and it made no difference.

mgartner commented 1 year ago

One thing that's suspicious in the trace from the slow statement bundle here is that there's a significant amount of time spent between steps that usually take <1ms from what I've seen in common statement bundle (maybe I'm just not very worldly w.r.t. traces?):

Is the ~30ms spent here a clue?

There's still a big unexplained gap in the trace—maybe it makes sense to insert addition tracing logs and try to pin down where the majority of that time is spent?

srosenberg commented 1 year ago

Is the ~30ms spent here a clue?

I'd say the first ~7ms is suspect. It suggests the request is being held in some queue(s).

DrewKimball commented 1 year ago

I'd say the first ~7ms is suspect. It suggests the request is being held in some queue(s).

Admission control is the obvious suspect there. It happens right after the first line.

cucaroach commented 1 year ago

I tried turning off admission control b/c I suspected it was letting in too much work and when in degraded mode if I disable admission control it seems to get worse.

One weird thing about this is that if you turn back on the histogram disabling for internal queries performance doesn't return to normal, we're stuck in degraded mode. But I discovered if I call reset_sql_stats and reset_activity_tables the throughput collapse goes away. So there's some bad interaction between how those tables work and histograms. TRUNCATE on the statement_statistics and transaction_statistics tables does nothing so it must be that the in memory cache for these things is the issue (the reset functions also reset the in memory).

So the next thing I tried was removing this lock: https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go#L359

Not postive that's completely sound but it seems to remove over half the throughput collapse, we bottom out around 30K instead of 20k, progress!

Making all those heap allocations holding a lock is silly but I need to draw a line between the histogram setting change and the cache performance, I suspect the cardinality of the maps goes way up. Interestingly the raft replication P99 is cut in half with this change, I still have no idea how that's connected to these front end happenings.

cucaroach commented 1 year ago

Not postive that's completely sound but it seems to remove over half the throughput collapse, we bottom out around 30K instead of 20k, progress!

This result didn't replicate after 3 runs, only the first looked like this, the last two fell into the <20k QPS hole.

srosenberg commented 1 year ago

It's probably been considered already... what about just logging EXPLAIN (OPT) for every issued query in both runs? We'd expect no material difference, but what if it is.

cucaroach commented 1 year ago

By saving off the statement_statistics between runs and doing some dogfooding I was able to narrow it down:

https://docs.google.com/spreadsheets/d/16M8MeaLlWr1UWyGlkn9IXa5fUC8g9Bfy9QVfsCWZ5TI

By disabling the optimizer histogram usage for just the first 10 queries I was able to get the good performance. Now I need to analyze what goes wrong with these queries/plans and narrow it down further.

michae2 commented 1 year ago

[triage] Our plan is to leave histograms disabled for internal queries in 23.2 and pick this back up during 24.1 development.

michae2 commented 9 months ago

[queries collab session] Based on Tommy's analysis, one idea is to use a table-level setting to disable histograms for system.statement_statistics, system.transaction_statistics, and system.statement_activity.

mgartner commented 3 months ago

The most likely explanation is that this makes some internal query plans worse, but we were unable to confirm that last time we looked at it.

This is currently preventing histograms from being used by internal executor queries. However, there's not a strong motivation to enable them—we're not aware of any query plans that would improve with the usage of histograms—so we're going to backlog this for now.

cockroachdb / cockroach

roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor] #102954