Open def- opened 1 month ago
I tried the Zippy Debezium Postgres scenario locally with bin/mzcompose --find zippy down && bin/mzcompose --find zippy run default --scenario=DebeziumPostgres --actions=1000000
and recorded thread dumps in the storage cluster from gdb:
https://gist.github.com/def-/c4a22d0b521611b45adf77e381a4a138
https://gist.github.com/def-/e95e20dc6df03e1aec893c532d748e80
$ grep "^Thread" threads1 | sed -e "s/.*\"\(.*\)\".*/\1/" | sort | uniq -c | sort -n ~
1 persist:0000
1 persist:0001
1 persist:0002
1 persist:0003
1 persist:0004
1 persist:0005
1 persist:0006
1 persist:0007
1 persist:0008
1 persist:0009
1 persist:000a
1 persist:000b
1 persist:000c
1 persist:000d
1 persist:000e
1 persist:000f
4 jemalloc_bg_thd
4 rocksdb:high
5 timely:work-0
5 timely:work-1
6 timely:work-2
6 timely:work-3
12 kafka-metadata
12 rdk:broker0
12 rdk:main
16 rocksdb:low
17 clusterd
19 tokio-runtime-w
24 rdk:broker-1
Then a bit later:
1 persist:0000
1 persist:0001
1 persist:0002
1 persist:0003
1 persist:0004
1 persist:0005
1 persist:0006
1 persist:0007
1 persist:0008
1 persist:0009
1 persist:000a
1 persist:000b
1 persist:000c
1 persist:000d
1 persist:000e
1 persist:000f
4 jemalloc_bg_thd
4 rocksdb:high
7 timely:work-0
7 timely:work-1
8 timely:work-2
8 timely:work-3
16 rocksdb:low
17 clusterd
20 kafka-metadata
20 rdk:main
22 rdk:broker0
29 tokio-runtime-w
38 rdk:broker-1
I'm not sure if this is enough of a signal but the number of rdk:broker* and tokio-runtime-w threads seems to increase with time/load. Is that expected?
This reminds me of https://github.com/MaterializeInc/incidents-and-escalations/issues/56 where we were also leaking kafka threads.
When switching Zippy runs from 48 CPU cores to 16, they still keep working, but stop throwing these thread panics. This to me indicates that somewhere in Mz we or a library we use decide the number of threads based on available cores and can pick too many.
What version of Materialize are you using?
a8b6855bfe34
What is the issue?
Seen in https://buildkite.com/materialize/test/builds/86029#0190bc34-ec0e-4145-bce8-9caeac72a3f9 during v0.108.4 release qualification:
The code in question was not touched in a year (this was forked before https://github.com/MaterializeInc/materialize/pull/28226 went into main).
This seems to be a flake.