Open def- opened 1 month ago
We can use CI Failures to see when this occurs again with more context. So far nothing.
Aha! we have one: https://buildkite.com/materialize/nightly/builds/9115#01915669-0a3f-438b-be63-2bc0b7b1b8bd
The relevant log snippets from that one are:
parallel-workload-materialized-1 | environmentd: 2024-08-15T15:07:25.976061Z INFO coord::handle_message{kind="purified_statement_ready"}:sequence_create_source:coord::catalog_transact_with_side_effects:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: create source db-pw-1723733000-0.s-14.kafka_table57 (u319)
parallel-workload-materialized2-1 | environmentd: 2024-08-15T15:07:33.881371Z INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator: mz_adapter::catalog::open: startup: controller init: beginning
parallel-workload-materialized2-1 | environmentd: 2024-08-15T15:07:33.886867Z INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator:controller::new: mz_controller: starting controllers in read-only mode!
parallel-workload-materialized-1 | environmentd: 2024-08-15T15:07:35.907644Z INFO coord::handle_message{kind="command-execute"}:message_command:coord::handle_execute{session="3136aea1-fc34-4a14-8076-1cd95dc84fd1"}:coord::handle_execute_inner{stmt="DROP SOURCE \"db-pw-1723733000-0\".\"s-14\".kafka_table57"}:sequence_drop_objects:coord::catalog_transact:coord::catalog_transact_conn:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: drop source db-pw-1723733000-0.s-14.kafka_table57 (u319)
parallel-workload-materialized-1 | environmentd: 2024-08-15T15:07:36.167253Z INFO mz_storage_client::storage_collections: removing collection state because the since advanced to []! id=u319
parallel-workload-materialized-1 | environmentd: 2024-08-15T15:07:36.176158Z INFO mz_storage_client::storage_collections: removing persist handles because the since advanced to []! id=u319
parallel-workload-materialized-1 | environmentd: 2024-08-15T15:07:36.176173Z INFO mz_storage_client::storage_collections: enqueing shard finalization due to dropped collection and dropped persist handle id=u319 dropped_shard_id=s0e63c9cf-f200-48f0-9a77-5541f63a175f
parallel-workload-materialized2-1 | environmentd: 2024-08-15T15:07:37.477783Z INFO mz_service::grpc: GrpcClient /tmp/3964e4004e68513c22fd04b478d92a227a0a66b5: connected
parallel-workload-materialized2-1 | thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-085e762fefce6d863-1/materialize/nightly/src/storage-controller/src/lib.rs:699:17:
parallel-workload-materialized2-1 | dependency since has advanced past dependent (u319) upper
parallel-workload-materialized2-1 |
parallel-workload-materialized2-1 | dependent (u319): upper Antichain { elements: [1723734456001] }
parallel-workload-materialized2-1 |
parallel-workload-materialized2-1 | dependency since Antichain { elements: [] }
parallel-workload-materialized2-1 |
parallel-workload-materialized2-1 | dependency read holds: [ReadHold { id: User(319), since: Antichain { elements: [] }, .. }, ReadHold { id: User(318), since: Antichain { elements: [0] }, .. }]
edit: See diagnosis below!
So looks like that shard/collection has been dropped and is in the process of finalization. But then we also try and bootstrap it in the storage controller. In the controller, we can see from the read holds that the since
is already the empty antichain, so we couldn't read it anymore.
I'll try and get to the bottom of how this can happen, but I still think it's not an issue for now.
I believe I have this diagnosed! What happens is this:
materialized
is the past deploy generationmaterialized2
is the generation that we're upgrading toI updated the log snippets above to basically lay out precisely what happens:
materialized
creates the (kafka) source u319
materialized2
comes up in read-only mode, reads the most recent catalog snapshot as of that time, starts controllersmaterialized
drops the source u319
, initiates shard finalization which advances the since to []
materialized2
keeps its (now-outdated) catalog state, that still contains u319
and tries to initialize its storage controller state, then notices that the sinces are no good for the upper
of that source -> panicThoughts:
since
to []
while a collection is still needed according to catalog state. I'd say even if we did that it wouldn't be a problem: it is legal to have a collection with since=[]
, it just means no-one can read from it.
- This ☝️, however, might lead to panics in other places. For example, a materialized view that builds on this source might now realize that the sinces of its dependencies (including that Kafka source, say) are now no good anymore. But again, that materialized view would already have been dropped in the latest catalog state, otherwise its (and its dependencies) sinces would not have been allowed to go to empty.
Is this something we need to solve up front, or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?
or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?
(Because I'm very in favor of this.)
or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?
(Because I'm very in favor of this.)
That's my preferred approach, yes!
What version of Materialize are you using?
736303ff07 (Pull Request #28873)
What is the issue?
Serious version of https://github.com/MaterializeInc/materialize/issues/28634 in Parallel Workload (0dt deploy)
ci-regexp: dependency since has advanced past dependent