MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Latency slowly creeps up when running same few SELECTs against a cluster #29526

Open def- opened 5 days ago

def- commented 5 days ago

What version of Materialize are you using?

v0.116.0

What is the issue?

Seen in https://buildkite.com/materialize/qa-canary/builds/228#0191eb33-4b0c-406a-af5e-1e5bf13c4413 on my PR introducing that test: https://github.com/MaterializeInc/materialize/pull/29524 QACanary_this This is running a few simple SELECT queries against a cluster. Only the SELECT 1 is open loop with 100 queries per second (not affected), while the rest are closed loop (and strict serializable) and are getting slower slowly with time:

Statistics for SELECT 1 (pooled):
  queries: 60000
  qps:  100.00
  min:    2.02ms
  avg:   11.51ms
  p50:    4.21ms
  p95:   36.24ms
  p99:   79.82ms
  max:  496.96ms
  std:   19.93ms
  slope: -0.0129
Statistics for SELECT * FROM qa_canary_environment.public_table.table_mv (reuse connection):
  queries:  1713
  qps:    2.86
  min:   52.22ms
  avg:  350.28ms
  p50:  338.93ms
  p95:  482.84ms
  p99:  596.52ms
  max:  869.84ms
  std:   81.92ms
  slope: 0.0533
Statistics for SELECT COUNT(DISTINCT a_name) FROM qa_canary_environment.public_mysql_cdc.mysql_wmr WHERE degree > 1 (reuse connection):
  queries:  2013
  qps:    3.36
  min:   83.28ms
  avg:  298.07ms
  p50:  312.89ms
  p95:  360.71ms
  p99:  403.32ms
  max:  635.58ms
  std:   58.38ms
  slope: 0.0464
Statistics for SELECT min(c), max(c), count(*) FROM qa_canary_environment.public_table.table (reuse connection):
  queries:  1133
  qps:    1.89
  min:   95.89ms
  avg:  529.60ms
  p50:  538.24ms
  p95:  692.57ms
  p99:  787.96ms
  max:  923.38ms
  std:  124.02ms
  slope: 0.1532
Statistics for SELECT COUNT(DISTINCT a_name) FROM qa_canary_environment.public_pg_cdc.wmr WHERE degree > 1 (reuse connection):
  queries:  2020
  qps:    3.37
  min:   99.21ms
  avg:  297.04ms
  p50:  311.60ms
  p95:  357.89ms
  p99:  383.76ms
  max:  623.20ms
  std:   54.57ms
  slope: 0.0574
Statistics for SELECT COUNT(DISTINCT count_star) FROM qa_canary_environment.public_loadgen.sales_product_product_category WHERE count_distinct_product_id > 0 (reuse connection):
  queries:  2759
  qps:    4.60
  min:   12.42ms
  avg:  217.48ms
  p50:  224.08ms
  p95:  328.81ms
  p99:  352.85ms
  max:  497.71ms
  std:   91.04ms
  slope: 0.0305

The workload is running against the Materialize Production Sandbox (maybe I should move it to a dedicated staging env to be more isolated from other noise?), and since the first attempt was only 10 minutes I'm now retrying with 1 hour: https://buildkite.com/materialize/qa-canary/builds/229 The cluster itself (200cc, https://console.materialize.com/regions/aws-us-east-1/clusters/u3/qa_canary_environment_compute?timePeriod=180) always stayed at <=50% CPU usage. Since it's not overloaded, I expected the queries' performance to stay consistent over time.

def- commented 5 days ago

One thing that came to mind is that @antiguru mentioned that the query history currently doesn't support running many queries against Materialize (because it eventually OoMs), maybe that's related? I'm not even able to load the query history anymore now

def- commented 5 days ago

The mz_catalog_server usage is going up and up, so I guess that is related and fits with the query history being responsible: Screenshot 2024-09-13 at 15 48 00 Screenshot 2024-09-13 at 15 49 34 For OLTP-type workloads this is probably a blocker (and also for me enabling this test).

antiguru commented 5 days ago

Could we run this against a staging environment that has statement logging disabled? This way we could unblock the test, which is important to have on its own.

def- commented 5 days ago

Could we run this against a staging environment that has statement logging disabled? This way we could unblock the test, which is important to have on its own.

Yes, that will be my next step, I was hoping not to have to this early because of cost and for convenience of having to recreate sources etc.

benesch commented 4 days ago

I was hoping not to have to this early because of cost and for convenience of having to recreate sources etc.

What environment is the test running against now? We could disable statement logging against that environment instead.

benesch commented 4 days ago

Oh, just saw the conversation on Slack. Never mind.