MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

thread 'timely:work-2' panicked at src/storage/src/source/kafka.rs:436:22: called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" } #28290

Open def- opened 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

a8b6855bfe34

What is the issue?

Seen in https://buildkite.com/materialize/test/builds/86029#0190bc34-ec0e-4145-bce8-9caeac72a3f9 during v0.108.4 release qualification:

thread 'timely:work-2' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-0b424d0e16dcf7305-1/materialize/test/src/storage/src/source/kafka.rs:436:22:
called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

The code in question was not touched in a year (this was forked before https://github.com/MaterializeInc/materialize/pull/28226 went into main).

ci-regexp: panicked at src/storage/src/source/kafka.rs:.*: called `Result::unwrap()` on an `Err` value

This seems to be a flake.

def- commented 1 month ago

I tried the Zippy Debezium Postgres scenario locally with bin/mzcompose --find zippy down && bin/mzcompose --find zippy run default --scenario=DebeziumPostgres --actions=1000000 and recorded thread dumps in the storage cluster from gdb: https://gist.github.com/def-/c4a22d0b521611b45adf77e381a4a138 https://gist.github.com/def-/e95e20dc6df03e1aec893c532d748e80

$ grep "^Thread" threads1 | sed -e "s/.*\"\(.*\)\".*/\1/" | sort | uniq -c | sort -n                                                                                                         ~
   1 persist:0000
   1 persist:0001
   1 persist:0002
   1 persist:0003
   1 persist:0004
   1 persist:0005
   1 persist:0006
   1 persist:0007
   1 persist:0008
   1 persist:0009
   1 persist:000a
   1 persist:000b
   1 persist:000c
   1 persist:000d
   1 persist:000e
   1 persist:000f
   4 jemalloc_bg_thd
   4 rocksdb:high
   5 timely:work-0
   5 timely:work-1
   6 timely:work-2
   6 timely:work-3
  12 kafka-metadata
  12 rdk:broker0
  12 rdk:main
  16 rocksdb:low
  17 clusterd
  19 tokio-runtime-w
  24 rdk:broker-1

Then a bit later:

   1 persist:0000
   1 persist:0001
   1 persist:0002
   1 persist:0003
   1 persist:0004
   1 persist:0005
   1 persist:0006
   1 persist:0007
   1 persist:0008
   1 persist:0009
   1 persist:000a
   1 persist:000b
   1 persist:000c
   1 persist:000d
   1 persist:000e
   1 persist:000f
   4 jemalloc_bg_thd
   4 rocksdb:high
   7 timely:work-0
   7 timely:work-1
   8 timely:work-2
   8 timely:work-3
  16 rocksdb:low
  17 clusterd
  20 kafka-metadata
  20 rdk:main
  22 rdk:broker0
  29 tokio-runtime-w
  38 rdk:broker-1

I'm not sure if this is enough of a signal but the number of rdk:broker* and tokio-runtime-w threads seems to increase with time/load. Is that expected?

def- commented 1 month ago

This reminds me of https://github.com/MaterializeInc/incidents-and-escalations/issues/56 where we were also leaking kafka threads.

def- commented 1 month ago

When switching Zippy runs from 48 CPU cores to 16, they still keep working, but stop throwing these thread panics. This to me indicates that somewhere in Mz we or a library we use decide the number of threads based on available cores and can pick too many.