MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

parallel-workload: 0dt: thread 'coordinator' panicked at src/storage-controller/src/lib.rs:703:17: dependency since has advanced past dependent (u417) upper #28885

Open def- opened 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

736303ff07 (Pull Request #28873)

What is the issue?

Serious version of https://github.com/MaterializeInc/materialize/issues/28634 in Parallel Workload (0dt deploy)

thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:703:17:
dependency since has advanced past dependent (u417) upper 

                            dependent (u417): upper Antichain { elements: [1723129772001] } 

                            dependency since Antichain { elements: [] }
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/panicking.rs:72:14
   2: <mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:703:17
   3: <tracing::instrument::Instrumented<<mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   4: <mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:491:5
   5: <core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<(), mz_storage_types::controller::StorageError<mz_repr::timestamp::Timestamp>>>>> as core::future::future::Future>::poll
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/future/future.rs:123:9
   6: <mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:2449:14
   7: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   8: <mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:2346:5
   9: <mz_adapter::coord::Coordinator>::bootstrap::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:1824:14
  10: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::bootstrap::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  11: <mz_adapter::coord::Coordinator>::bootstrap::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:1727:5
  12: mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:3728:26
  13: <tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  14: <tokio::runtime::park::CachedParkThread>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/park.rs:281:63
  15: tokio::runtime::coop::with_budget::<core::task::poll::Poll<core::result::Result<(), mz_adapter::error::AdapterError>>, <tokio::runtime::park::CachedParkThread>::block_on<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
  16: tokio::runtime::coop::budget::<core::task::poll::Poll<core::result::Result<(), mz_adapter::error::AdapterError>>, <tokio::runtime::park::CachedParkThread>::block_on<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
  17: <tokio::runtime::park::CachedParkThread>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/park.rs:281:31
  18: <tokio::runtime::context::blocking::BlockingRegionGuard>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/blocking.rs:66:9
  19: <tokio::runtime::handle::Handle>::block_on::<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/handle.rs:310:22
  20: tokio::runtime::context::runtime::enter_runtime::<<tokio::runtime::handle::Handle>::block_on<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>::{closure#0}, core::result::Result<(), mz_adapter::error::AdapterError>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
  21: <tokio::runtime::handle::Handle>::block_on::<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/handle.rs:309:9
  22: mz_adapter::coord::serve::{closure#0}::{closure#2}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:3721:33
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

ci-regexp: dependency since has advanced past dependent

def- commented 1 month ago

We can use CI Failures to see when this occurs again with more context. So far nothing.

aljoscha commented 4 weeks ago

Aha! we have one: https://buildkite.com/materialize/nightly/builds/9115#01915669-0a3f-438b-be63-2bc0b7b1b8bd

The relevant log snippets from that one are:

parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:25.976061Z  INFO coord::handle_message{kind="purified_statement_ready"}:sequence_create_source:coord::catalog_transact_with_side_effects:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: create source db-pw-1723733000-0.s-14.kafka_table57 (u319)

parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:33.881371Z  INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator: mz_adapter::catalog::open: startup: controller init: beginning
parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:33.886867Z  INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator:controller::new: mz_controller: starting controllers in read-only mode!

parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:35.907644Z  INFO coord::handle_message{kind="command-execute"}:message_command:coord::handle_execute{session="3136aea1-fc34-4a14-8076-1cd95dc84fd1"}:coord::handle_execute_inner{stmt="DROP SOURCE \"db-pw-1723733000-0\".\"s-14\".kafka_table57"}:sequence_drop_objects:coord::catalog_transact:coord::catalog_transact_conn:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: drop source db-pw-1723733000-0.s-14.kafka_table57 (u319)
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.167253Z  INFO mz_storage_client::storage_collections: removing collection state because the since advanced to []! id=u319
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.176158Z  INFO mz_storage_client::storage_collections: removing persist handles because the since advanced to []! id=u319
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.176173Z  INFO mz_storage_client::storage_collections: enqueing shard finalization due to dropped collection and dropped persist handle id=u319 dropped_shard_id=s0e63c9cf-f200-48f0-9a77-5541f63a175f

parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:37.477783Z  INFO mz_service::grpc: GrpcClient /tmp/3964e4004e68513c22fd04b478d92a227a0a66b5: connected
parallel-workload-materialized2-1    | thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-085e762fefce6d863-1/materialize/nightly/src/storage-controller/src/lib.rs:699:17:
parallel-workload-materialized2-1    | dependency since has advanced past dependent (u319) upper 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependent (u319): upper Antichain { elements: [1723734456001] } 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependency since Antichain { elements: [] } 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependency read holds: [ReadHold { id: User(319), since: Antichain { elements: [] }, .. }, ReadHold { id: User(318), since: Antichain { elements: [0] }, .. }]

edit: See diagnosis below!

So looks like that shard/collection has been dropped and is in the process of finalization. But then we also try and bootstrap it in the storage controller. In the controller, we can see from the read holds that the since is already the empty antichain, so we couldn't read it anymore.

I'll try and get to the bottom of how this can happen, but I still think it's not an issue for now.

aljoscha commented 4 weeks ago

I believe I have this diagnosed! What happens is this:

I updated the log snippets above to basically lay out precisely what happens:

  1. materialized creates the (kafka) source u319
  2. materialized2 comes up in read-only mode, reads the most recent catalog snapshot as of that time, starts controllers
  3. materialized drops the source u319, initiates shard finalization which advances the since to []
  4. materialized2 keeps its (now-outdated) catalog state, that still contains u319 and tries to initialize its storage controller state, then notices that the sinces are no good for the upper of that source -> panic

Thoughts:

benesch commented 4 weeks ago
  • This ☝️, however, might lead to panics in other places. For example, a materialized view that builds on this source might now realize that the sinces of its dependencies (including that Kafka source, say) are now no good anymore. But again, that materialized view would already have been dropped in the latest catalog state, otherwise its (and its dependencies) sinces would not have been allowed to go to empty.

Is this something we need to solve up front, or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

benesch commented 4 weeks ago

or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

(Because I'm very in favor of this.)

aljoscha commented 3 weeks ago

or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

(Because I'm very in favor of this.)

That's my preferred approach, yes!