MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.71k stars 466 forks source link

0dt upgrade: thread 'tokio:work-19' panicked at src/storage-controller/src/persist_handles.rs:304:17: cannot register [...] at 1720813373649 because txns is at 1720813373828 #28216

Closed def- closed 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

4335764dd2 (Pull Request #28191)

What is the issue?

Seen in https://github.com/MaterializeInc/materialize/pull/28191's nightly run: https://buildkite.com/materialize/nightly/builds/8526#0190a865-7db5-45dd-bf6c-80591609bacb:

thread 'tokio:work-19' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-021e3b57519a1f97a-1/materialize/test/src/storage-controller/src/persist_handles.rs:304:17:
cannot register [System(444), System(445), System(446), System(447), System(448), System(449), System(450), System(451), System(452), System(453), System(454), System(455), System(456), System(457), System(458), System(459), System(460), System(461), System(462), System(463), System(464), System(465), System(466), System(467), System(468), System(469), System(470), System(471), System(472), System(473), System(474), System(476), System(477), System(478), System(479), System(480), System(481), System(482), System(483), System(484), System(485), System(486), System(487), System(488), System(489), System(490), System(491), System(492), System(493), System(494), System(495), System(496), System(711), System(1372), User(23), User(27), User(28), User(32), User(36), User(54), User(55), User(56), User(57), User(58), User(62), User(63), User(64), User(66), User(67), User(68), User(69), User(71), User(73), User(83), User(84), User(98), User(99), User(124), User(128), User(130), User(132), User(133), User(137), User(139), User(140), User(150), User(156), User(163), User(170), User(178), User(179), User(180), User(185), User(186), User(187), User(188), User(200), User(211), User(221), User(225), User(227), User(228), User(229), User(233), User(234), User(238), User(241), User(243), User(244), User(245), User(246), User(247), User(249), User(261), User(262), User(265), User(272), User(273), User(274), User(278), User(285), User(286), User(288), User(289), User(290), User(300), User(303), User(304), User(311), User(313), User(325), User(331), User(332), User(334), User(336)] at 1720813373649 because txns is at 1720813373828
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
   2: <mz_storage_controller::persist_handles::TxnsTableWorker<mz_repr::timestamp::Timestamp>>::register::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-021e3b57519a1f97a-1/materialize/test/src/storage-controller/src/persist_handles.rs:304:17
   3: <tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::TxnsTableWorker<mz_repr::timestamp::Timestamp>>::register::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   4: <mz_storage_controller::persist_handles::TxnsTableWorker<mz_repr::timestamp::Timestamp>>::run::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-021e3b57519a1f97a-1/materialize/test/src/storage-controller/src/persist_handles.rs:231:26
   5: <mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-021e3b57519a1f97a-1/materialize/test/src/storage-controller/src/persist_handles.rs:145:28
   6: <tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   7: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle>>>::poll::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/core.rs:328:17
   8: <tokio::loom::std::unsafe_cell::UnsafeCell<tokio::runtime::task::core::Stage<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>>>>::with_mut::<core::task::poll::Poll<()>, <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle>>>::poll::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/loom/std/unsafe_cell.rs:16:9
   9: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle>>>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/core.rs:317:30
  10: tokio::runtime::task::harness::poll_future::<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:485:19
  11: <core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panic/unwind_safe.rs:272:9
  12: std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}>, core::task::poll::Poll<()>>
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
  13: std::panicking::try::<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}>>
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
  14: std::panic::catch_unwind::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}>, core::task::poll::Poll<()>>
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
  15: tokio::runtime::task::harness::poll_future::<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:473:18
  16: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll_inner
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:208:27
  17: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<<mz_storage_controller::persist_handles::PersistTableWriteWorker<mz_repr::timestamp::Timestamp>>::new_txns::{closure#1}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:153:15
  18: <tokio::runtime::task::raw::RawTask>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/raw.rs:201:18
  19: <tokio::runtime::task::LocalNotified<alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::run
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/mod.rs:427:9
  20: <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:648:22
  21: tokio::runtime::coop::with_budget::<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core>, ()>, <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
  22: tokio::runtime::coop::budget::<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core>, ()>, <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
  23: <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:584:9
  24: <tokio::runtime::scheduler::multi_thread::worker::Context>::run
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:535:24
  25: tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:500:21
  26: <tokio::runtime::context::scoped::Scoped<tokio::runtime::scheduler::Context>>::set::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}, ()>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/scoped.rs:40:9
  27: tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:495:9
  28: tokio::runtime::context::runtime::enter_runtime::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}, ()>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
  29: tokio::runtime::scheduler::multi_thread::worker::run
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:487:5
  30: <tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/multi_thread/worker.rs:455:45
  31: <tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/blocking/task.rs:42:21
  32: <tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  33: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/core.rs:328:17
  34: <tokio::loom::std::unsafe_cell::UnsafeCell<tokio::runtime::task::core::Stage<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>>>>::with_mut::<core::task::poll::Poll<()>, <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/loom/std/unsafe_cell.rs:16:9
  35: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/core.rs:317:30
  36: tokio::runtime::task::harness::poll_future::<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:485:19
  37: <core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panic/unwind_safe.rs:272:9
  38: std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}>, core::task::poll::Poll<()>>
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
  39: std::panicking::try::<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}>>
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
  40: std::panic::catch_unwind::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}>, core::task::poll::Poll<()>>
             at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
  41: tokio::runtime::task::harness::poll_future::<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:473:18
  42: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll_inner
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:208:27
  43: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/harness.rs:153:15
  44: <tokio::runtime::task::raw::RawTask>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/raw.rs:201:18
  45: <tokio::runtime::task::UnownedTask<tokio::runtime::blocking::schedule::BlockingSchedule>>::run
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/task/mod.rs:464:9
  46: <tokio::runtime::blocking::pool::Task>::run
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/blocking/pool.rs:159:9
  47: <tokio::runtime::blocking::pool::Inner>::run
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/blocking/pool.rs:513:17
  48: <tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/blocking/pool.rs:471:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This is also happening a lot now:

platform-checks-mz_2-1              | thread 'timely:work-3' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-033ac7e951f12fbe9-1/materialize/test/src/compute/src/render.rs:235:33: compute_import: u183: s0e5c152b-7eed-4b4f-a25a-66e037e3b4bb cannot serve requested as_of Antichain { elements: [1720813341001] }: Since(Antichain { elements: [1720813350000] })

@aljoscha Should we keep tracking that as part of https://github.com/MaterializeInc/materialize/issues/27345, or is it a new issue since it happens reliably during 0dt upgrades?

ci-regexp: cannot register .* because txns is at

benesch commented 1 month ago

Possibly fixed by https://github.com/MaterializeInc/materialize/pull/28324. 🤞🏽

def- commented 1 month ago

I'm still seeing this panic even with #28324 : https://buildkite.com/materialize/nightly/builds/8628

benesch commented 1 month ago

I'm still seeing this panic even with #28324 : https://buildkite.com/materialize/nightly/builds/8628

Oops. Turns out that PR made things quite a bit worse due to the way it interleaved with builtin table writes. Pushed up an alternate implementation that hopefully fixes both the issue introduced by the PR and this issue.

benesch commented 1 month ago

@def- how's this nightly run looking? https://buildkite.com/materialize/nightly/builds/8630#_

def- commented 1 month ago

Looking good. Let me try rerunning with #28324 again: https://buildkite.com/materialize/nightly/builds/8632

def- commented 1 month ago

This stopped occurring just after your PR landed @benesch , perfect: https://ci-failures.dev.materialize.com/?key=test-failures&tfFilters=%5B%7B%22id%22%3A%22issue%22%2C%22value%22%3A%22materialize%2F28216%22%7D%5D