MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

thread 'tokio-runtime-worker' panicked at src/cluster/src/communication.rs:216:32: someone claimed to be us #28997

Open def- opened 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

eddd6923419b (Pull Request #28996)

What is the issue?

Seen in Replica isolation

thread 'tokio-runtime-worker' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-092d578492aa35fdd-1/materialize/test/src/cluster/src/communication.rs:216:32:
someone claimed to be us
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: mz_cluster::communication::initialize_networking::{closure#0}
   3: <mz_cluster::server::ClusterClient<mz_service::client::Partitioned<mz_service::local::LocalClient<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse, std::thread::Thread>, mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>, mz_storage::server::Config, mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse> as mz_service::client::GenericClient<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::send::{closure#0}
   4: <alloc::boxed::Box<dyn mz_storage_client::client::StorageClient> as mz_service::client::GenericClient<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::send::{closure#0}
   5: <async_stream::async_stream::AsyncStream<core::result::Result<mz_storage_client::client::ProtoStorageResponse, tonic::status::Status>, <mz_service::grpc::GrpcServer<mz_storage::server::serve::{closure#0}>>::forward_bidi_stream<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse, mz_storage_client::client::ProtoStorageCommand, mz_storage_client::client::ProtoStorageResponse>::{closure#0}::{closure#1}> as futures_core::stream::Stream>::poll_next
   6: <tonic::codec::encode::EncodeBody<tonic::codec::encode::EncodedBytes<tonic::codec::prost::ProstEncoder<mz_storage_client::client::ProtoStorageResponse>, tokio_stream::stream_ext::fuse::Fuse<mz_service::grpc::ResponseStream<mz_storage_client::client::ProtoStorageResponse>>>> as http_body::Body>::poll_frame
   7: <http_body_util::combinators::map_err::MapErr<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>, <tonic::status::Status>::map_error<tonic::status::Status>> as http_body::Body>::poll_frame
   8: <http_body_util::combinators::map_err::MapErr<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>, <axum_core::error::Error>::new<tonic::status::Status>> as http_body::Body>::poll_frame
   9: <http_body_util::combinators::map_err::MapErr<axum_core::body::Body, <tonic::status::Status>::map_error<axum_core::error::Error>> as http_body::Body>::poll_frame
  10: <http_body_util::combinators::map_err::MapErr<http_body_util::combinators::map_err::MapErr<tonic::transport::server::service::recover_error::MaybeEmptyBody<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, <tonic::status::Status as core::convert::Into<alloc::boxed::Box<dyn core::error::Error + core::marker::Send + core::marker::Sync>>>::into>, <tonic::status::Status>::map_error<alloc::boxed::Box<dyn core::error::Error + core::marker::Send + core::marker::Sync>>> as http_body::Body>::poll_frame
  11: <hyper::proto::h2::PipeToSendStream<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>> as core::future::future::Future>::poll
  12: <tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<hyper_util::service::TowerToHyperServiceFuture<tower::util::map_request::MapRequest<tower::util::boxed_clone::BoxCloneService<http::request::Request<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, http::response::Response<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Send + core::marker::Sync>>, <tonic::transport::server::Server>::serve_with_shutdown<tonic::service::router::Routes, mz_ore::netio::socket::Listener, core::future::ready::Ready<()>, mz_ore::netio::socket::Stream, std::io::error::Error, http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>::{closure#0}::{closure#3}>, http::request::Request<hyper::body::incoming::Incoming>>, http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>> as core::future::future::Future>::poll
  13: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<hyper_util::service::TowerToHyperServiceFuture<tower::util::map_request::MapRequest<tower::util::boxed_clone::BoxCloneService<http::request::Request<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, http::response::Response<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Send + core::marker::Sync>>, <tonic::transport::server::Server>::serve_with_shutdown<tonic::service::router::Routes, mz_ore::netio::socket::Listener, core::future::ready::Ready<()>, mz_ore::netio::socket::Stream, std::io::error::Error, http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>::{closure#0}::{closure#3}>, http::request::Request<hyper::body::incoming::Incoming>>, http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread_alt::handle::Handle>>>::poll
  14: tokio::runtime::task::raw::poll::<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<hyper_util::service::TowerToHyperServiceFuture<tower::util::map_request::MapRequest<tower::util::boxed_clone::BoxCloneService<http::request::Request<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, http::response::Response<http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Send + core::marker::Sync>>, <tonic::transport::server::Server>::serve_with_shutdown<tonic::service::router::Routes, mz_ore::netio::socket::Listener, core::future::ready::Ready<()>, mz_ore::netio::socket::Stream, std::io::error::Error, http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>::{closure#0}::{closure#3}>, http::request::Request<hyper::body::incoming::Incoming>>, http_body_util::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>
  15: <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task
  16: <tokio::runtime::context::scoped::Scoped<tokio::runtime::scheduler::Context>>::set::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}, ()>
  17: tokio::runtime::context::runtime::enter_runtime::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}, ()>
  18: tokio::runtime::scheduler::multi_thread::worker::run
  19: <tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}> as core::future::future::Future>::poll
  20: <tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>> as core::future::future::Future>::poll
  21: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
  22: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
  23: <tokio::runtime::blocking::pool::Inner>::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

ci-regexp: someone claimed to be us

def- commented 1 month ago

Seen again in CI. I'll try to repro locally. Edit: can be reproduced, but took a while: while true; do bin/mzcompose --find replica-isolation down && bin/mzcompose --find replica-isolation run default || break; done. Now trying with the new Mz restart logic reverted. Edit2: Still happens, so unrelated to that. Now trying with https://github.com/MaterializeInc/materialize/pull/28380 and the commit just before it merged. Edit3: Reproduced on #28380 and never on the state before it, but will keep it running for a few more hours.

def- commented 4 weeks ago

Before #28380 I got this failure instead:

replica-isolation-materialized-1     | environmentd: 2024-08-15T13:36:34.862548Z  INFO mz_compute_client::controller::replica: error connecting to replica, retrying in 1s: transport error: dns error: failed to lookup address information: Temporary failure in name resolution: dns error: failed to lookup address information: Temporary failure in name resolution: failed to lookup address information: Temporary failure in name resolution replica=User(2)

Apparently clusterd had a crash of sorts:

replica-isolation-clusterd_1_2-1     | 2024-08-15T13:34:56.769911Z  WARN mz_timely_util::panic: halting process: timely communication error: reading data: Connection reset by peer (os error 104)

So something was weird there too, but never this panic. services.log

teskje commented 4 weeks ago

That looks like https://github.com/MaterializeInc/materialize/issues/28046, one of the two issues that #28380 was intended to fix.