parallel-workload: 0dt: thread 'coordinator' panicked at src/storage-controller/src/lib.rs:703:17: dependency since has advanced past dependent (u417) upper

What version of Materialize are you using?

736303ff07 (Pull Request #28873)

What is the issue?

Serious version of https://github.com/MaterializeInc/materialize/issues/28634 in Parallel Workload (0dt deploy)

thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:703:17:
dependency since has advanced past dependent (u417) upper 

                            dependent (u417): upper Antichain { elements: [1723129772001] } 

                            dependency since Antichain { elements: [] }
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/panicking.rs:72:14
   2: <mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:703:17
   3: <tracing::instrument::Instrumented<<mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   4: <mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:491:5
   5: <core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<(), mz_storage_types::controller::StorageError<mz_repr::timestamp::Timestamp>>>>> as core::future::future::Future>::poll
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/future/future.rs:123:9
   6: <mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:2449:14
   7: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   8: <mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:2346:5
   9: <mz_adapter::coord::Coordinator>::bootstrap::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:1824:14
  10: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::bootstrap::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  11: <mz_adapter::coord::Coordinator>::bootstrap::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:1727:5
  12: mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:3728:26
  13: <tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  14: <tokio::runtime::park::CachedParkThread>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/park.rs:281:63
  15: tokio::runtime::coop::with_budget::<core::task::poll::Poll<core::result::Result<(), mz_adapter::error::AdapterError>>, <tokio::runtime::park::CachedParkThread>::block_on<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
  16: tokio::runtime::coop::budget::<core::task::poll::Poll<core::result::Result<(), mz_adapter::error::AdapterError>>, <tokio::runtime::park::CachedParkThread>::block_on<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
  17: <tokio::runtime::park::CachedParkThread>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/park.rs:281:31
  18: <tokio::runtime::context::blocking::BlockingRegionGuard>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/blocking.rs:66:9
  19: <tokio::runtime::handle::Handle>::block_on::<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/handle.rs:310:22
  20: tokio::runtime::context::runtime::enter_runtime::<<tokio::runtime::handle::Handle>::block_on<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>::{closure#0}, core::result::Result<(), mz_adapter::error::AdapterError>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
  21: <tokio::runtime::handle::Handle>::block_on::<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/handle.rs:309:9
  22: mz_adapter::coord::serve::{closure#0}::{closure#2}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:3721:33
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

ci-regexp: dependency since has advanced past dependent

We can use CI Failures to see when this occurs again with more context. So far nothing.

Aha! we have one: https://buildkite.com/materialize/nightly/builds/9115#01915669-0a3f-438b-be63-2bc0b7b1b8bd

The relevant log snippets from that one are:

parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:25.976061Z  INFO coord::handle_message{kind="purified_statement_ready"}:sequence_create_source:coord::catalog_transact_with_side_effects:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: create source db-pw-1723733000-0.s-14.kafka_table57 (u319)

parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:33.881371Z  INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator: mz_adapter::catalog::open: startup: controller init: beginning
parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:33.886867Z  INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator:controller::new: mz_controller: starting controllers in read-only mode!

parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:35.907644Z  INFO coord::handle_message{kind="command-execute"}:message_command:coord::handle_execute{session="3136aea1-fc34-4a14-8076-1cd95dc84fd1"}:coord::handle_execute_inner{stmt="DROP SOURCE \"db-pw-1723733000-0\".\"s-14\".kafka_table57"}:sequence_drop_objects:coord::catalog_transact:coord::catalog_transact_conn:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: drop source db-pw-1723733000-0.s-14.kafka_table57 (u319)
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.167253Z  INFO mz_storage_client::storage_collections: removing collection state because the since advanced to []! id=u319
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.176158Z  INFO mz_storage_client::storage_collections: removing persist handles because the since advanced to []! id=u319
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.176173Z  INFO mz_storage_client::storage_collections: enqueing shard finalization due to dropped collection and dropped persist handle id=u319 dropped_shard_id=s0e63c9cf-f200-48f0-9a77-5541f63a175f

parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:37.477783Z  INFO mz_service::grpc: GrpcClient /tmp/3964e4004e68513c22fd04b478d92a227a0a66b5: connected
parallel-workload-materialized2-1    | thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-085e762fefce6d863-1/materialize/nightly/src/storage-controller/src/lib.rs:699:17:
parallel-workload-materialized2-1    | dependency since has advanced past dependent (u319) upper 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependent (u319): upper Antichain { elements: [1723734456001] } 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependency since Antichain { elements: [] } 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependency read holds: [ReadHold { id: User(319), since: Antichain { elements: [] }, .. }, ReadHold { id: User(318), since: Antichain { elements: [0] }, .. }]

edit: See diagnosis below!

So looks like that shard/collection has been dropped and is in the process of finalization. But then we also try and bootstrap it in the storage controller. In the controller, we can see from the read holds that the since is already the empty antichain, so we couldn't read it anymore.

I'll try and get to the bottom of how this can happen, but I still think it's not an issue for now.

I believe I have this diagnosed! What happens is this:

materialized is the past deploy generation
materialized2 is the generation that we're upgrading to

I updated the log snippets above to basically lay out precisely what happens:

materialized creates the (kafka) source u319
materialized2 comes up in read-only mode, reads the most recent catalog snapshot as of that time, starts controllers
materialized drops the source u319, initiates shard finalization which advances the since to []
materialized2 keeps its (now-outdated) catalog state, that still contains u319 and tries to initialize its storage controller state, then notices that the sinces are no good for the upper of that source -> panic

Thoughts:

I don't think we can prevent this, as long as we keep state in different places. Here: the catalog shard and frontiers in individual persist shards. We can't update these atomically.
It's not a problem, as long as we don't downgrade the since to [] while a collection is still needed according to catalog state. I'd say even if we did that it wouldn't be a problem: it is legal to have a collection with since=[], it just means no-one can read from it.
I think the solution is to downgrade this assert https://github.com/MaterializeInc/materialize/blob/f5b681abad68572d46b650a11f5196ae3d39eea4/src/storage-controller/src/lib.rs#L699 to not check frontiers when the since is empty. Because no one can observe changes to the shard anymore. And to also not run ingestions for collections whose since is empty.
This ☝️, however, might lead to panics in other places. For example, a materialized view that builds on this source might now realize that the sinces of its dependencies (including that Kafka source, say) are now no good anymore. But again, that materialized view would already have been dropped in the latest catalog state, otherwise its (and its dependencies) sinces would not have been allowed to go to empty.

This ☝️, however, might lead to panics in other places. For example, a materialized view that builds on this source might now realize that the sinces of its dependencies (including that Kafka source, say) are now no good anymore. But again, that materialized view would already have been dropped in the latest catalog state, otherwise its (and its dependencies) sinces would not have been allowed to go to empty.

Is this something we need to solve up front, or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

(Because I'm very in favor of this.)

or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

(Because I'm very in favor of this.)

That's my preferred approach, yes!

MaterializeInc / materialize

parallel-workload: 0dt: thread 'coordinator' panicked at src/storage-controller/src/lib.rs:703:17: dependency since has advanced past dependent (u417) upper #28885

What version of Materialize are you using?

What is the issue?