Closed def- closed 3 months ago
I've seen a similar segfault in CI. It points to this function:
(gdb) disassemble 0x58d3000
Dump of assembler code for function _RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeINtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core4CellINtNtCs5ucvXM0JFgk_7tracing10instrument12InstrumentedINtNtNtCseBpelh4Gd7n_12futures_util6future6future3MapINtNtB2r_7poll_fn6PollFnNCNCNvMs1_NtNtCsfrN94GYaqgt_5hyper6client6clientINtB3L_6ClientINtCsWA1kU9jSU3_13hyper_timeout16TimeoutConnectorINtCsjkPAOE70uaW_13hyper_openssl14HttpsConnectorNtNtNtB3N_7connect4http13HttpConnectorEEE12send_request0s3_0ENCB3D_s4_0EEINtNtCsf2W9KFidpdm_5alloc4sync3ArcNtNtNtNtBM_9scheduler12multi_thread6handle6HandleEEECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724:
0x00000000058d2fd0 <+0>: push %r14
0x00000000058d2fd2 <+2>: push %rbx
0x00000000058d2fd3 <+3>: push %rax
0x00000000058d2fd4 <+4>: mov %rdi,%rbx
0x00000000058d2fd7 <+7>: mov 0x28(%rdi),%rax
0x00000000058d2fdb <+11>: lock decq (%rax)
0x00000000058d2fdf <+15>: jne 0x58d2fea <_RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeINtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core4CellINtNtCs5ucvXM0JFgk_7tracing10instrument12InstrumentedINtNtNtCseBpelh4Gd7n_12futures_util6future6future3MapINtNtB2r_7poll_fn6PollFnNCNCNvMs1_NtNtCsfrN94GYaqgt_5hyper6client6clientINtB3L_6ClientINtCsWA1kU9jSU3_13hyper_timeout16TimeoutConnectorINtCsjkPAOE70uaW_13hyper_openssl14HttpsConnectorNtNtNtB3N_7connect4http13HttpConnectorEEE12send_request0s3_0ENCB3D_s4_0EEINtNtCsf2W9KFidpdm_5alloc4sync3ArcNtNtNtNtBM_9scheduler12multi_thread6handle6HandleEEECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724+26>
0x00000000058d2fe1 <+17>: lea 0x28(%rbx),%rdi
0x00000000058d2fe5 <+21>: call 0x593b8a0 <_RNvMsf_NtCsf2W9KFidpdm_5alloc4syncINtB5_3ArcNtNtNtNtNtCs3t61LHhTfKA_5tokio7runtime9scheduler12multi_thread6handle6HandleE9drop_slowCskhiGdiyWS6U_11kube_client>
0x00000000058d2fea <+26>: lea 0x38(%rbx),%rdi
0x00000000058d2fee <+30>: call 0x58d39f0 <_RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeINtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core5StageINtNtCs5ucvXM0JFgk_7tracing10instrument12InstrumentedINtNtNtCseBpelh4Gd7n_12futures_util6future6future3MapINtNtB2s_7poll_fn6PollFnNCNCNvMs1_NtNtCsfrN94GYaqgt_5hyper6client6clientINtB3M_6ClientINtCsWA1kU9jSU3_13hyper_timeout16TimeoutConnectorINtCsjkPAOE70uaW_13hyper_openssl14HttpsConnectorNtNtNtB3O_7connect4http13HttpConnectorEEE12send_request0s3_0ENCB3E_s4_0EEEECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724>
0x00000000058d2ff3 <+35>: mov 0xe0(%rbx),%rax
0x00000000058d2ffa <+42>: test %rax,%rax
0x00000000058d2ffd <+45>: je 0x58d3010 <_RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeINtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core4CellINtNtCs5ucvXM0JFgk_7tracing10instrument12InstrumentedINtNtNtCseBpelh4Gd7n_12futures_util6future6future3MapINtNtB2r_7poll_fn6PollFnNCNCNvMs1_NtNtCsfrN94GYaqgt_5hyper6client6clientINtB3L_6ClientINtCsWA1kU9jSU3_13hyper_timeout16TimeoutConnectorINtCsjkPAOE70uaW_13hyper_openssl14HttpsConnectorNtNtNtB3N_7connect4http13HttpConnectorEEE12send_request0s3_0ENCB3D_s4_0EEINtNtCsf2W9KFidpdm_5alloc4sync3ArcNtNtNtNtBM_9scheduler12multi_thread6handle6HandleEEECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724+64>
0x00000000058d2fff <+47>: mov 0xe8(%rbx),%rdi
0x00000000058d3006 <+54>: add $0x8,%rsp
0x00000000058d300a <+58>: pop %rbx
0x00000000058d300b <+59>: pop %r14
0x00000000058d300d <+61>: jmp *0x18(%rax)
0x00000000058d3010 <+64>: add $0x8,%rsp
0x00000000058d3014 <+68>: pop %rbx
0x00000000058d3015 <+69>: pop %r14
0x00000000058d3017 <+71>: ret
0x00000000058d3018 <+72>: mov %rax,%r14
0x00000000058d301b <+75>: lea 0x38(%rbx),%rdi
0x00000000058d301f <+79>: call 0x58d3d80 <_RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeINtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core9CoreStageINtNtCs5ucvXM0JFgk_7tracing10instrument12InstrumentedINtNtNtCseBpelh4Gd7n_12futures_util6future6future3MapINtNtB2w_7poll_fn6PollFnNCNCNvMs1_NtNtCsfrN94GYaqgt_5hyper6client6clientINtB3Q_6ClientINtCsWA1kU9jSU3_13hyper_timeout16TimeoutConnectorINtCsjkPAOE70uaW_13hyper_openssl14HttpsConnectorNtNtNtB3S_7connect4http13HttpConnectorEEE12send_request0s3_0ENCB3I_s4_0EEEECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724>
0x00000000058d3024 <+84>: jmp 0x58d3031 <_RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeINtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core4CellINtNtCs5ucvXM0JFgk_7tracing10instrument12InstrumentedINtNtNtCseBpelh4Gd7n_12futures_util6future6future3MapINtNtB2r_7poll_fn6PollFnNCNCNvMs1_NtNtCsfrN94GYaqgt_5hyper6client6clientINtB3L_6ClientINtCsWA1kU9jSU3_13hyper_timeout16TimeoutConnectorINtCsjkPAOE70uaW_13hyper_openssl14HttpsConnectorNtNtNtB3N_7connect4http13HttpConnectorEEE12send_request0s3_0ENCB3D_s4_0EEINtNtCsf2W9KFidpdm_5alloc4sync3ArcNtNtNtNtBM_9scheduler12multi_thread6handle6HandleEEECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724+97>
0x00000000058d3026 <+86>: call *0x213c734(%rip) # 0x7a0f760
0x00000000058d302c <+92>: ud2
0x00000000058d302e <+94>: mov %rax,%r14
0x00000000058d3031 <+97>: mov 0xe0(%rbx),%rdi
0x00000000058d3038 <+104>: mov 0xe8(%rbx),%rsi
0x00000000058d303f <+111>: call 0x58d64c0 <_RINvNtCskl7yHyHAUN_4core3ptr13drop_in_placeNtNtNtNtCs3t61LHhTfKA_5tokio7runtime4task4core7TrailerECskhiGdiyWS6U_11kube_client.llvm.9057245008734995724>
0x00000000058d3044 <+116>: mov %r14,%rdi
0x00000000058d3047 <+119>: call 0x76d42f0 <_Unwind_Resume@plt>
0x00000000058d304c <+124>: ud2
0x00000000058d304e <+126>: call *0x213c70c(%rip) # 0x7a0f760
0x00000000058d3054 <+132>: ud2
End of assembler dump.
Finally caught one in action: https://buildkite.com/materialize/tests/builds/69416#018bf897-c578-4168-8881-bf6663c705ef
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000556d39cdf8d4 in _rjem_je_prof_free_sampled_object (tsd=tsd@entry=0x7f388c1f9eb0, usize=usize@entry=16384, prof_info=0x7f388c1f8f80) at src/prof.c:158
[Current thread is 1 (LWP 113)]
#0 0x0000556d39cdf8d4 in _rjem_je_prof_free_sampled_object (tsd=tsd@entry=0x7f388c1f9eb0, usize=usize@entry=16384, prof_info=0x7f388c1f8f80) at src/prof.c:158
tctx = 0x7f38bac160e0
szind = 36
#1 0x0000556d39c76cc1 in prof_free (alloc_ctx=<synthetic pointer>, usize=<optimized out>, ptr=0x7f38bac46000, tsd=0x7f388c1f9eb0) at include/jemalloc/internal/prof_inlines.h:257
prof_info = {alloc_time = {ns = 1759010770992}, alloc_tctx = 0x7f38bac160e0, alloc_size = 1040}
prof_info = <optimized out>
#2 ifree (slow_path=false, tcache=0x7f388c1fa208, ptr=0x7f38bac46000, tsd=0x7f388c1f9eb0) at src/jemalloc.c:2883
alloc_ctx = <optimized out>
usize = <optimized out>
alloc_ctx = <optimized out>
usize = <optimized out>
#3 _rjem_je_free_default (ptr=0x7f38bac46000) at src/jemalloc.c:3014
tcache = 0x7f388c1fa208
tsd = 0x7f388c1f9eb0
#4 0x00007f38bb0af5bf in *ABS*+0xa7120@plt () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5 0x0000556d3a2ece88 in ?? ()
No symbol table info available.
#6 0x0000556d39ae7dc0 in ?? ()
No symbol table info available.
#7 0x0000000000000013 in ?? ()
No symbol table info available.
#8 0x00000001347b0308 in ?? ()
No symbol table info available.
#9 0x00007f388c1f9060 in ?? ()
No symbol table info available.
#10 0x0000000000000002 in ?? ()
No symbol table info available.
#11 0x0000000000000013 in ?? ()
No symbol table info available.
#12 0x00007f388c1f9088 in ?? ()
No symbol table info available.
#13 0x0000556d3a3c2aa8 in std::alloc::HOOK::hf3bb7349d9c03b41 ()
No symbol table info available.
#14 0x00007f388c1f90a0 in ?? ()
No symbol table info available.
#15 0x0000556d3a2eceb8 in mz_timely_util[72eab8da58d9ad79]::panic::halt_on_timely_communication_panic::{closure#0}::CALLSITE::META ()
No symbol table info available.
#16 0x00007f38bb0af610 in *ABS*+0xa48c0@plt () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#17 0x0000000000000001 in ?? ()
No symbol table info available.
#18 0x0000556d39547bf7 in std::sys::unix::os::exit () at library/std/src/sys/unix/os.rs:660
No locals.
#19 0x0000556d39540e4f in std::process::exit () at library/std/src/process.rs:2124
No locals.
#20 0x0000556d39117eb8 in mz_timely_util::panic::halt_on_timely_communication_panic::{closure#0} (panic_info=<optimized out>) at src/timely-util/src/panic.rs:36
e = <optimized out>
e = <optimized out>
old_hook = <optimized out>
#21 0x0000556d39544880 in alloc::boxed::{impl#49}::call<(&core::panic::panic_info::PanicInfo), (dyn core::ops::function::Fn<(&core::panic::panic_info::PanicInfo), Output=()> + core::marker::Send + core::marker::Sync), alloc::alloc::Global> () at library/alloc/src/boxed.rs:1999
No locals.
#22 std::panicking::rust_panic_with_hook () at library/std/src/panicking.rs:709
No locals.
#23 0x0000556d39544624 in std::panicking::begin_panic_handler::{closure#0} () at library/std/src/panicking.rs:597
No locals.
#24 0x0000556d39542836 in std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::{closure_env#0}, !> () at library/std/src/sys_common/backtrace.rs:151
No locals.
#25 0x0000556d395443c2 in std::panicking::begin_panic_handler () at library/std/src/panicking.rs:593
No locals.
#26 0x0000556d3810eb73 in core::panicking::panic_fmt () at library/core/src/panicking.rs:67
No locals.
#27 0x0000556d395abee3 in timely_communication::allocator::zero_copy::tcp::tcp_panic (context=..., cause=...) at src/allocator/zero_copy/tcp.rs:21
No locals.
#28 0x0000556d357d542e in timely_communication::allocator::zero_copy::tcp::recv_loop<std::net::tcp::TcpStream> (reader=..., targets=..., worker_offset=4, process=1, remote=0, logger=...) at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/io/error/repr_bitpacked.rs:176
active = true
stageds = alloc::vec::Vec<alloc::vec::Vec<timely_bytes::arc::Bytes, alloc::alloc::Global>, alloc::alloc::Global> {buf: alloc::raw_vec::RawVec<alloc::vec::Vec<timely_bytes::arc::Bytes, alloc::alloc::Global>, alloc::alloc::Global> {ptr: core::ptr::unique::Unique<alloc::vec::Vec<timely_bytes::arc::Bytes, alloc::alloc::Global>> {pointer: core::ptr::non_null::NonNull<alloc::vec::Vec<timely_bytes::arc::Bytes, alloc::alloc::Global>> {pointer: 0x7f3888c06060}, _marker: core::marker::PhantomData<alloc::vec::Vec<timely_bytes::arc::Bytes, alloc::alloc::Global>>}, cap: 4, alloc: alloc::alloc::Global}, len: 4}
buffer = timely_communication::allocator::zero_copy::bytes_slab::BytesSlab {buffer: timely_bytes::arc::Bytes {ptr: 0x7f3888c6ccdf, len: 684833, sequestered: alloc::sync::Arc<dyn core::any::Any> {ptr: core::ptr::non_null::NonNull<alloc::sync::ArcInner<dyn core::any::Any>> {pointer: *const alloc::sync::ArcInner<dyn core::any::Any> {pointer: 0x7f3888c00040, vtable: 0x556d3a32d168 <anon.3b7f9802824c1dc73692a97590bcc92f.6.llvm>}}, phantom: core::marker::PhantomData<alloc::sync::ArcInner<dyn core::any::Any>>}}, in_progress: alloc::vec::Vec<core::option::Option<timely_bytes::arc::Bytes>, alloc::alloc::Global> {buf: alloc::raw_vec::RawVec<core::option::Option<timely_bytes::arc::Bytes>, alloc::alloc::Global> {ptr: core::ptr::unique::Unique<core::option::Option<timely_bytes::arc::Bytes>> {pointer: core::ptr::non_null::NonNull<core::option::Option<timely_bytes::arc::Bytes>> {pointer: 0x8}, _marker: core::marker::PhantomData<core::option::Option<timely_bytes::arc::Bytes>>}, cap: 0, alloc: alloc::alloc::Global}, len: 0}, stash: alloc::vec::Vec<timely_bytes::arc::Bytes, alloc::alloc::Global> {buf: alloc::raw_vec::RawVec<timely_bytes::arc::Bytes, alloc::alloc::Global> {ptr: core::ptr::unique::Unique<timely_bytes::arc::Bytes> {pointer: core::ptr::non_null::NonNull<timely_bytes::arc::Bytes> {pointer: 0x8}, _marker: core::marker::PhantomData<timely_bytes::arc::Bytes>}, cap: 0, alloc: alloc::alloc::Global}, len: 0}, shift: 20, valid: 0}
targets = alloc::vec::Vec<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue, alloc::alloc::Global> {buf: alloc::raw_vec::RawVec<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue, alloc::alloc::Global> {ptr: core::ptr::unique::Unique<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue> {pointer: core::ptr::non_null::NonNull<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue> {pointer: 0x7f3888c06000}, _marker: core::marker::PhantomData<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue>}, cap: 4, alloc: alloc::alloc::Global}, len: 4}
#29 0x0000556d371b3910 in timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets::{closure#2}<std::net::tcp::TcpStream> () at /cargo/git/checkouts/timely-dataflow-70b80d81d6cabd62/9adb32c/communication/src/allocator/zero_copy/initialize.rs:121
log_sender = alloc::sync::Arc<alloc::boxed::Box<(dyn core::ops::function::Fn<(timely_communication::logging::CommunicationSetup), Output=core::option::Option<timely_logging::Logger<timely_communication::logging::CommunicationEvent, timely_communication::logging::CommunicationSetup>>> + core::marker::Send + core::marker::Sync), alloc::alloc::Global>> {ptr: core::ptr::non_null::NonNull<alloc::sync::ArcInner<alloc::boxed::Box<(dyn core::ops::function::Fn<(timely_communication::logging::CommunicationSetup), Output=core::option::Option<timely_logging::Logger<timely_communication::logging::CommunicationEvent, timely_communication::logging::CommunicationSetup>>> + core::marker::Send + core::marker::Sync), alloc::alloc::Global>>> {pointer: 0x7f38aa200100}, phantom: core::marker::PhantomData<alloc::sync::ArcInner<alloc::boxed::Box<(dyn core::ops::function::Fn<(timely_communication::logging::CommunicationSetup), Output=core::option::Option<timely_logging::Logger<timely_communication::logging::CommunicationEvent, timely_communication::logging::CommunicationSetup>>> + core::marker::Send + core::marker::Sync), alloc::alloc::Global>>>}
my_index = 1
index = 0
stream = std::net::tcp::TcpStream (std::sys_common::net::TcpStream {inner: std::sys::unix::net::Socket (std::sys::unix::fd::FileDesc (std::os::fd::owned::OwnedFd {fd: 19}))})
remote_send = alloc::vec::Vec<crossbeam_channel::channel::Receiver<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue>, alloc::alloc::Global> {buf: alloc::raw_vec::RawVec<crossbeam_channel::channel::Receiver<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue>, alloc::alloc::Global> {ptr: core::ptr::unique::Unique<crossbeam_channel::channel::Receiver<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue>> {pointer: core::ptr::non_null::NonNull<crossbeam_channel::channel::Receiver<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue>> {pointer: 0x7f38a9a4e000}, _marker: core::marker::PhantomData<crossbeam_channel::channel::Receiver<timely_communication::allocator::zero_copy::bytes_exchange::MergeQueue>>}, cap: 4, alloc: alloc::alloc::Global}, len: 4}
threads = 4
logger = <error reading variable logger (Cannot access memory at address 0x0)>
Full gdb backtrace: https://gist.github.com/def-/c8b529ec734c2a42c7a1e279acf594db
This seems to be happening quite often now in CI. I'm not sure if it's the same segfault we originally had in September or a different one, so this might be a release blocker. Marking for now until someone can take a look. I'm not sure if this should go to storage or compute. It's pretty deep in timely, but the clusterd seems to have been busy on kafka related sources:
#4 0x0000556d371b5fcd in mz_storage::source::kafka::{impl#0}::render::{closure#0}::{async_block#0}::{closure#4}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_timely_util::order::Partitioned<i32, mz_storage_types::sources::MzOffset>>, futures_util::stream::stream::Inspect<async_stream::async_stream::AsyncStream<timely::progress::frontier::Antichain<mz_timely_util::order::Partitioned<i32, mz_storage_types::sources::MzOffset>>, mz_storage::source::source_reader_pipeline::reclock_resume_upper::{async_block_env#0}<mz_timely_util::order::Partitioned<i32, mz_storage_types::sources::MzOffset>, mz_repr::timestamp::Timestamp>>, mz_storage::source::source_reader_pipeline::source_render_operator::{closure_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_timely_util::order::Partitioned<i32, mz_storage_types::sources::MzOffset>>, mz_storage_types::sources::KafkaSourceConnection<mz_storage_types::connections::inline::InlinedConnection>, async_stream::async_stream::AsyncStream<timely::progress::frontier::Antichain<mz_timely_util::order::Partitioned<i32, mz_storage_types::sources::MzOffset>>, mz_storage::source::source_reader_pipeline::reclock_resume_upper::{async_block_env#0}<mz_timely_util::order::Partitioned<i32, mz_storage_types::sources::MzOffset>, mz_repr::timestamp::Timestamp>>, mz_storage::render::sources::render_source::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>>>>, mz_storage::render::sources::render_source::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>>> () at src/storage/src/source/kafka.rs:405
@def- is it thread 1 that segfaulted in that backtrace?
I believe its saying we panicking here: https://github.com/MaterializeInc/materialize/blob/main/src/timely-util/src/panic.rs#L36, perhaps the panic payload string is invalid?
This panic can be caused by an io error that happens after we transmute a non-repr-C type (this one): https://github.com/MaterializeInc/timely-dataflow/blob/master/communication/src/networking.rs#L21
is it thread 1 that segfaulted in that backtrace?
Yes:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000556d39cdf8d4 in _rjem_je_prof_free_sampled_object (tsd=tsd@entry=0x7f388c1f9eb0, usize=usize@entry=16384, prof_info=0x7f388c1f8f80) at src/prof.c:158
[Current thread is 1 (LWP 113)]
Jemalloc profiling was recently enabled by default (je
in the function name refers to jemalloc) , if I recall correctly. So maybe there is a bug in jmalloc and profiling needs to be re-disabled?
I tried disabling jemalloc profiling and still saw 2 segfaults: https://buildkite.com/materialize/tests/builds/69555
Grafana shows no segfaults in any of the cloud stacks for the last 30 days: query. Which means either I messed up that query, or the regression is a new one, or it's specific to CI somehow.
I kicked-off a address-sanitizer run of slt locally to see if it surfaces anything. Will report once it's done.
Segfaults are scary, but it seems they're only limited to CI. @def-, do you have a way to reproduce this locally? Reasons for this to happen could also be a faulty node, co-locating compute and storage objects (it shouldn't, but who knows). Is it possible to run CI with address sanitation?
I ran one of the failing tests locally for a a day without being able to reproduce it. So this is still a mystery to me unfortunately. We don't currently have a way of running CI with ASan.
Since these only seem to happen in CI and not locally, I'm wondering if this is also related to Docker weirdness. We might see after https://github.com/MaterializeInc/materialize/pull/23496 is submitted.
I also tried switching back to Rust 1.71 and it didn't help, still seeing these: https://github.com/MaterializeInc/materialize/pull/23523
@def- We think debugging this further requires asan testing. I think we ran into problems when testing envd under asan, but could you try running only clusterd under asan instead? It doesn't use stacker and thus maybe doesn't introduce the issues we were seeing with envd.
This was seen here in a release qualification run: https://buildkite.com/materialize/release-qualification/builds/419#018d6b13-78ce-4c81-b665-28f970d061f2
Core was generated by `/usr/local/bin/clusterd --storage-controller-listen-addr=/tmp/7c348084873b3776d'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000aaaac07f41d4 in _rjem_je_prof_free_sampled_object (tsd=tsd@entry=0xffff717cb0f8, usize=usize@entry=16384, prof_info=prof_info@entry=0xffff717c98c8) at src/prof.c:158
[Current thread is 1 (LWP 89646)]
#0 0x0000aaaac07f41d4 in _rjem_je_prof_free_sampled_object (tsd=tsd@entry=0xffff717cb0f8, usize=usize@entry=16384, prof_info=prof_info@entry=0xffff717c98c8) at src/prof.c:158
tctx = 0xffffb6c340e0
szind = 36
#1 0x0000aaaac07aed00 in prof_free (alloc_ctx=<synthetic pointer>, usize=16384, ptr=0xffffb6c3b000, tsd=0xffff717cb0f8) at include/jemalloc/internal/prof_inlines.h:257
prof_info = {alloc_time = {ns = 1853139581027}, alloc_tctx = 0xffffb6c340e0, alloc_size = 1040}
prof_info = <optimized out>
#2 ifree (slow_path=false, tcache=0xffff717cb450, ptr=0xffffb6c3b000, tsd=0xffff717cb0f8) at src/jemalloc.c:2883
alloc_ctx = <optimized out>
usize = 16384
alloc_ctx = <optimized out>
usize = <optimized out>
#3 _rjem_je_free_default (ptr=0xffffb6c3b000) at src/jemalloc.c:3014
tcache = 0xffff717cb450
tsd = 0xffff717cb0f8
#4 0x0000ffffb717cecc in __printf_buffer_putc (ch=0 '\000', buf=0xffffb6c3b000) at ../include/printf_buffer.h:148
No locals.
#5 __vstrfmon_l_internal (flags=0, ap=..., format=<optimized out>, loc=<optimized out>, maxsize=<optimized out>, s=<optimized out>) at ./stdlib/strfmon_l.c:575
buf = {write_base = 0xaaaac0171a88 <std::process::exit+20> " ", write_ptr = 0xffff717c996c "\252\252", write_end = 0xaaaabfdbeee8 <mz_timely_util::panic::halt_on_timely_communication_panic::{closure#0}+872> " ", written = 187650356536624, mode = 1903991704}
buf = <optimized out>
#6 ___strfmon_l (s=<optimized out>, maxsize=<optimized out>, loc=<optimized out>, format=<optimized out>) at ./stdlib/strfmon_l.c:589
ap = {__stack = 0xaaaabab8e228 <anon.e27d5ef5bf9f321d6fa271d4e165cee1.1.llvm>, __gr_top = 0xffff717cb080, __vr_top = 0xffffa57fa8c0, __gr_offs = 1901854720, __vr_offs = 65535}
res = <optimized out>
#7 0x0000ffffb717cf0c in ___strfmon_l (s=<optimized out>, maxsize=<optimized out>, loc=<optimized out>, format=<optimized out>) at ./stdlib/strfmon_l.c:594
ap = {__stack = 0xffff717c9fe0, __gr_top = 0xaaaac0175bb8 <std::panicking::rust_panic_with_hook+568>, __vr_top = 0xffff717cb080, __gr_offs = -1162288600, __vr_offs = 43690}
res = <optimized out>
#8 0x0000000000000051 in ?? ()
As usual this happened in jemalloc freeing memory. The unusual part is that this lead to the test actually failing, so it has become more serious than before.
In the meantime I have made some progress on asan (https://github.com/MaterializeInc/materialize/pull/24888), here is a local sqllogictest run which unfortunately seems to have found dozens of problems: https://gist.github.com/def-/45e08b7fd3cd7cc5d88dce6ecc823682 (via bin/sqllogictest --release --asan --no-default-features -- -v -v -v -v test/sqllogictest/*.slt
on that PR)
I believe this double-free is causing the issue, seen in https://buildkite.com/materialize/tests/builds/74947#018d7e9c-ba9f-4ffe-b675-e0609b3717f6 (via https://github.com/MaterializeInc/materialize/pull/24888):
=================================================================
==7==ERROR: AddressSanitizer: attempting double-free on 0x51900000b480 in thread T30:
2024-02-06T15:16:14.494316Z WARN mz_timely_util::panic: halting process: timely communication error: flushing writer: Broken pipe (os error 32)
#0 0xaaaab1726500 in cfree.part.8 (/usr/local/bin/clusterd+0x3cb6500) (BuildId: 0748fb28db213eff)
#1 0xffff9c60cec8 (/lib/aarch64-linux-gnu/libc.so.6+0x3cec8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#2 0xffff9c60cf08 in exit (/lib/aarch64-linux-gnu/libc.so.6+0x3cf08) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#3 0xaaaac1ad219c in std::sys::unix::os::exit::hd30414d8170c7e3b (.llvm.7749465669004736203) (/usr/local/bin/clusterd+0x1406219c) (BuildId: 0748fb28db213eff)
#4 0xaaaac1aca7c8 in std::process::exit::h851436d6d5fce807 (/usr/local/bin/clusterd+0x1405a7c8) (BuildId: 0748fb28db213eff)
#5 0xaaaac0e63bc0 in mz_timely_util::panic::halt_on_timely_communication_panic::{closure#0} (/usr/local/bin/clusterd+0x133f3bc0) (BuildId: 0748fb28db213eff)
#6 0xaaaac1ace80c in std::panicking::rust_panic_with_hook::h49c448744bd13368 (/usr/local/bin/clusterd+0x1405e80c) (BuildId: 0748fb28db213eff)
#7 0xaaaac1ace5b0 in std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::hc59ff399480db150 (/usr/local/bin/clusterd+0x1405e5b0) (BuildId: 0748fb28db213eff)
#8 0xaaaac1acc7d8 in std::sys_common::backtrace::__rust_end_short_backtrace::h9b12bcdf19180a12 (.llvm.7749465669004736203) (/usr/local/bin/clusterd+0x1405c7d8) (BuildId: 0748fb28db213eff)
#9 0xaaaac1ace33c in rust_begin_unwind (/usr/local/bin/clusterd+0x1405e33c) (BuildId: 0748fb28db213eff)
#10 0xaaaabda060d4 in core::panicking::panic_fmt::h240d97c04b4a38df (/usr/local/bin/clusterd+0xff960d4) (BuildId: 0748fb28db213eff)
#11 0xaaaac1c01470 in timely_communication::allocator::zero_copy::tcp::tcp_panic (/usr/local/bin/clusterd+0x14191470) (BuildId: 0748fb28db213eff)
#12 0xaaaaba57bb68 in timely_communication::allocator::zero_copy::tcp::recv_loop::<std::net::tcp::TcpStream> (/usr/local/bin/clusterd+0xcb0bb68) (BuildId: 0748fb28db213eff)
#13 0xaaaabbf19814 in std::sys_common::backtrace::__rust_begin_short_backtrace::<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()> (/usr/local/bin/clusterd+0xe4a9814) (BuildId: 0748fb28db213eff)
#14 0xaaaabd4ce6e4 in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<<std::thread::Builder>::spawn_unchecked_<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()>::{closure#1}::{closure#0}>, ()> (.llvm.4543018447158957069) (/usr/local/bin/clusterd+0xfa5e6e4) (BuildId: 0748fb28db213eff)
#15 0xaaaabd8e26e8 in __rust_try.llvm.4543018447158957069 (/usr/local/bin/clusterd+0xfe726e8) (BuildId: 0748fb28db213eff)
#16 0xaaaabd441794 in std::panicking::try::<(), core::panic::unwind_safe::AssertUnwindSafe<<std::thread::Builder>::spawn_unchecked_<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()>::{closure#1}::{closure#0}>> (/usr/local/bin/clusterd+0xf9d1794) (BuildId: 0748fb28db213eff)
#17 0xaaaabc2bdf1c in <<std::thread::Builder>::spawn_unchecked_<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()>::{closure#1} as core::ops::function::FnOnce<()>>::call_once::{shim:vtable#0} (/usr/local/bin/clusterd+0xe84df1c) (BuildId: 0748fb28db213eff)
#18 0xaaaac1ad40f4 in std::sys::unix::thread::Thread::new::thread_start::hc0f701f0c9161618 (/usr/local/bin/clusterd+0x140640f4) (BuildId: 0748fb28db213eff)
#19 0xaaaab16787d0 in asan_thread_start(void*) (/usr/local/bin/clusterd+0x3c087d0) (BuildId: 0748fb28db213eff)
#20 0xffff9c64d5c4 (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#21 0xffff9c6b5ed8 (/lib/aarch64-linux-gnu/libc.so.6+0xe5ed8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
0x51900000b480 is located 0 bytes inside of 1040-byte region [0x51900000b480,0x51900000b890)
freed by thread T27 here:
#0 0xaaaab1726500 in cfree.part.8 (/usr/local/bin/clusterd+0x3cb6500) (BuildId: 0748fb28db213eff)
#1 0xffff9c60cec8 (/lib/aarch64-linux-gnu/libc.so.6+0x3cec8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#2 0xffff9c60cf08 in exit (/lib/aarch64-linux-gnu/libc.so.6+0x3cf08) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#3 0xaaaac1ad219c in std::sys::unix::os::exit::hd30414d8170c7e3b (.llvm.7749465669004736203) (/usr/local/bin/clusterd+0x1406219c) (BuildId: 0748fb28db213eff)
#4 0xaaaac1ace80c in std::panicking::rust_panic_with_hook::h49c448744bd13368 (/usr/local/bin/clusterd+0x1405e80c) (BuildId: 0748fb28db213eff)
#5 0xaaaaba57bb68 in timely_communication::allocator::zero_copy::tcp::recv_loop::<std::net::tcp::TcpStream> (/usr/local/bin/clusterd+0xcb0bb68) (BuildId: 0748fb28db213eff)
#6 0xaaaabbf19814 in std::sys_common::backtrace::__rust_begin_short_backtrace::<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()> (/usr/local/bin/clusterd+0xe4a9814) (BuildId: 0748fb28db213eff)
#7 0xaaaabd4ce6e4 in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<<std::thread::Builder>::spawn_unchecked_<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()>::{closure#1}::{closure#0}>, ()> (.llvm.4543018447158957069) (/usr/local/bin/clusterd+0xfa5e6e4) (BuildId: 0748fb28db213eff)
#8 0xaaaabd8e26e8 in __rust_try.llvm.4543018447158957069 (/usr/local/bin/clusterd+0xfe726e8) (BuildId: 0748fb28db213eff)
#9 0xaaaabc2bdf1c in <<std::thread::Builder>::spawn_unchecked_<timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets<std::net::tcp::TcpStream>::{closure#2}, ()>::{closure#1} as core::ops::function::FnOnce<()>>::call_once::{shim:vtable#0} (/usr/local/bin/clusterd+0xe84df1c) (BuildId: 0748fb28db213eff)
#10 0xaaaac1ad40f4 in std::sys::unix::thread::Thread::new::thread_start::hc0f701f0c9161618 (/usr/local/bin/clusterd+0x140640f4) (BuildId: 0748fb28db213eff)
#11 0xffff9c6b5ed8 (/lib/aarch64-linux-gnu/libc.so.6+0xe5ed8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
previously allocated by thread T0 here:
LLVMSymbolizer: error reading file: No such file or directory
#0 0xaaaab1726730 in calloc.part.10 (/usr/local/bin/clusterd+0x3cb6730) (BuildId: 0748fb28db213eff)
#1 0xffff9c60d0bc (/lib/aarch64-linux-gnu/libc.so.6+0x3d0bc) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#2 0xffff9c60d158 (/lib/aarch64-linux-gnu/libc.so.6+0x3d158) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#3 0xaaaab170ba08 in __cxa_atexit (/usr/local/bin/clusterd+0x3c9ba08) (BuildId: 0748fb28db213eff)
#4 0xaaaac29525bc in __libc_csu_init (/usr/local/bin/clusterd+0x14ee25bc) (BuildId: 0748fb28db213eff)
#5 0xffff9c5f74a8 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x274a8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#6 0xaaaab1664034 in _start (/usr/local/bin/clusterd+0x3bf4034) (BuildId: 0748fb28db213eff)
#7 0xffffc3455de9 ([stack]+0x59de9)
Thread T30 created by T7 here:
#0 0xaaaab1708108 in pthread_create (/usr/local/bin/clusterd+0x3c98108) (BuildId: 0748fb28db213eff)
#1 0xaaaac1ad3f34 in std::sys::unix::thread::Thread::new::h1ac6cc45154cc15f (/usr/local/bin/clusterd+0x14063f34) (BuildId: 0748fb28db213eff)
#2 0xaaaaba576de4 in timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets::<std::net::tcp::TcpStream> (/usr/local/bin/clusterd+0xcb06de4) (BuildId: 0748fb28db213eff)
#3 0xaaaabbec1924 in mz_cluster::communication::initialize_networking_inner::<std::net::tcp::TcpStream> (/usr/local/bin/clusterd+0xe451924) (BuildId: 0748fb28db213eff)
#4 0xaaaabcb0a244 in mz_cluster::communication::initialize_networking::{closure#0} (/usr/local/bin/clusterd+0xf09a244) (BuildId: 0748fb28db213eff)
#5 0xaaaabcb1c878 in <mz_cluster::server::ClusterClient<mz_service::client::Partitioned<mz_service::local::LocalClient<mz_compute_client::protocol::command::ComputeCommand, mz_compute_client::protocol::response::ComputeResponse, timely::scheduling::activate::SyncActivator>, mz_compute_client::protocol::command::ComputeCommand, mz_compute_client::protocol::response::ComputeResponse>, mz_compute::server::Config, mz_compute_client::protocol::command::ComputeCommand, mz_compute_client::protocol::response::ComputeResponse> as mz_service::client::GenericClient<mz_compute_client::protocol::command::ComputeCommand, mz_compute_client::protocol::response::ComputeResponse>>::send::{closure#0} (/usr/local/bin/clusterd+0xf0ac878) (BuildId: 0748fb28db213eff)
#6 0xaaaabcb2521c in <alloc::boxed::Box<dyn mz_compute_client::service::ComputeClient> as mz_service::client::GenericClient<mz_compute_client::protocol::command::ComputeCommand, mz_compute_client::protocol::response::ComputeResponse>>::send::{closure#0} (/usr/local/bin/clusterd+0xf0b521c) (BuildId: 0748fb28db213eff)
#7 0xaaaabc473974 in <async_stream::async_stream::AsyncStream<core::result::Result<mz_compute_client::protocol::response::ProtoComputeResponse, tonic::status::Status>, <mz_service::grpc::GrpcServer<mz_compute::server::serve::{closure#0}>>::forward_bidi_stream<mz_compute_client::protocol::command::ComputeCommand, mz_compute_client::protocol::response::ComputeResponse, mz_compute_client::protocol::command::ProtoComputeCommand, mz_compute_client::protocol::response::ProtoComputeResponse>::{closure#0}::{closure#1}> as futures_core::stream::Stream>::poll_next (/usr/local/bin/clusterd+0xea03974) (BuildId: 0748fb28db213eff)
#8 0xaaaaba9b3ec8 in <futures_util::stream::stream::map::Map<futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_compute_client::protocol::response::ProtoComputeResponse, tonic::status::Status>> + core::marker::Send>>>, tonic::codec::encode::encode<tonic::codec::prost::ProstEncoder<mz_compute_client::protocol::response::ProtoComputeResponse>, futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_compute_client::protocol::response::ProtoComputeResponse, tonic::status::Status>> + core::marker::Send>>>>::{closure#0}> as futures_core::stream::TryStream>::try_poll_next (/usr/local/bin/clusterd+0xcf43ec8) (BuildId: 0748fb28db213eff)
#9 0xaaaabc4a8f18 in <tonic::codec::encode::EncodeBody<futures_util::stream::try_stream::into_stream::IntoStream<futures_util::stream::stream::map::Map<futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_compute_client::protocol::response::ProtoComputeResponse, tonic::status::Status>> + core::marker::Send>>>, tonic::codec::encode::encode<tonic::codec::prost::ProstEncoder<mz_compute_client::protocol::response::ProtoComputeResponse>, futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_compute_client::protocol::response::ProtoComputeResponse, tonic::status::Status>> + core::marker::Send>>>>::{closure#0}>>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xea38f18) (BuildId: 0748fb28db213eff)
#10 0xaaaabbca6a34 in <http_body::combinators::map_err::MapErr<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>, <tonic::status::Status>::map_error<tonic::status::Status>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xe236a34) (BuildId: 0748fb28db213eff)
#11 0xaaaabbca6318 in <http_body::combinators::map_err::MapErr<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>, <axum_core::error::Error>::new<tonic::status::Status>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xe236318) (BuildId: 0748fb28db213eff)
#12 0xaaaac1fe0e98 in <http_body::combinators::map_err::MapErr<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, axum_core::error::Error>, <tonic::status::Status>::map_error<axum_core::error::Error>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0x14570e98) (BuildId: 0748fb28db213eff)
#13 0xaaaabbca9700 in <http_body::combinators::map_err::MapErr<tonic::transport::server::recover_error::MaybeEmptyBody<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, <tonic::status::Status as core::convert::Into<alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>::into> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xe239700) (BuildId: 0748fb28db213eff)
#14 0xaaaab82a6e48 in <hyper::proto::h2::PipeToSendStream<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0xa836e48) (BuildId: 0748fb28db213eff)
#15 0xaaaab7cea41c in <hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0xa27a41c) (BuildId: 0748fb28db213eff)
#16 0xaaaabc39bdb4 in <tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0xe92bdb4) (BuildId: 0748fb28db213eff)
#17 0xaaaab7bbe21c in <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll (/usr/local/bin/clusterd+0xa14e21c) (BuildId: 0748fb28db213eff)
#18 0xaaaabd4b0bf8 in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}>, core::task::poll::Poll<()>> (/usr/local/bin/clusterd+0xfa40bf8) (BuildId: 0748fb28db213eff)
#19 0xaaaabd8e26e8 in __rust_try.llvm.4543018447158957069 (/usr/local/bin/clusterd+0xfe726e8) (BuildId: 0748fb28db213eff)
#20 0xaaaab795673c in tokio::runtime::task::raw::poll::<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>> (/usr/local/bin/clusterd+0x9ee673c) (BuildId: 0748fb28db213eff)
#21 0xaaaac1c4cb78 in <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task (/usr/local/bin/clusterd+0x141dcb78) (BuildId: 0748fb28db213eff)
#22 0xaaaac1c409d4 in <tokio::runtime::context::scoped::Scoped<tokio::runtime::scheduler::Context>>::set::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}, ()> (/usr/local/bin/clusterd+0x141d09d4) (BuildId: 0748fb28db213eff)
#23 0xaaaac1d32d4c in tokio::runtime::context::set_scheduler::<(), tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}> (/usr/local/bin/clusterd+0x142c2d4c) (BuildId: 0748fb28db213eff)
#24 0xaaaac1c5c81c in tokio::runtime::context::runtime::enter_runtime::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}, ()> (/usr/local/bin/clusterd+0x141ec81c) (BuildId: 0748fb28db213eff)
#25 0xaaaac1c4bfd8 in tokio::runtime::scheduler::multi_thread::worker::run (/usr/local/bin/clusterd+0x141dbfd8) (BuildId: 0748fb28db213eff)
#26 0xaaaac1d0bb00 in <tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0x1429bb00) (BuildId: 0748fb28db213eff)
#27 0xaaaac1cc6290 in <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll (/usr/local/bin/clusterd+0x14256290) (BuildId: 0748fb28db213eff)
#28 0xaaaac1d1eeb8 in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}>, core::task::poll::Poll<()>> (/usr/local/bin/clusterd+0x142aeeb8) (BuildId: 0748fb28db213eff)
#29 0xaaaac1d284d8 in __rust_try.llvm.15611257329881442963 (/usr/local/bin/clusterd+0x142b84d8) (BuildId: 0748fb28db213eff)
#30 0xaaaac1c054b4 in <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll (/usr/local/bin/clusterd+0x141954b4) (BuildId: 0748fb28db213eff)
#31 0xaaaac1cac0b8 in <tokio::runtime::task::UnownedTask<tokio::runtime::blocking::schedule::BlockingSchedule>>::run (/usr/local/bin/clusterd+0x1423c0b8) (BuildId: 0748fb28db213eff)
#32 0xaaaac1c2f990 in std::sys_common::backtrace::__rust_begin_short_backtrace::<<tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}, ()> (/usr/local/bin/clusterd+0x141bf990) (BuildId: 0748fb28db213eff)
#33 0xaaaac1d2099c in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<<std::thread::Builder>::spawn_unchecked_<<tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}, ()>::{closure#1}::{closure#0}>, ()> (.llvm.15611257329881442963) (/usr/local/bin/clusterd+0x142b099c) (BuildId: 0748fb28db213eff)
#34 0xaaaac1d284d8 in __rust_try.llvm.15611257329881442963 (/usr/local/bin/clusterd+0x142b84d8) (BuildId: 0748fb28db213eff)
#35 0xaaaac1c2ff5c in <<std::thread::Builder>::spawn_unchecked_<<tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}, ()>::{closure#1} as core::ops::function::FnOnce<()>>::call_once::{shim:vtable#0} (/usr/local/bin/clusterd+0x141bff5c) (BuildId: 0748fb28db213eff)
#36 0xaaaac1ad40f4 in std::sys::unix::thread::Thread::new::thread_start::hc0f701f0c9161618 (/usr/local/bin/clusterd+0x140640f4) (BuildId: 0748fb28db213eff)
#37 0xffff9c6b5ed8 (/lib/aarch64-linux-gnu/libc.so.6+0xe5ed8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
Thread T7 created by T0 here:
#0 0xaaaab1708108 in pthread_create (/usr/local/bin/clusterd+0x3c98108) (BuildId: 0748fb28db213eff)
#1 0xaaaac1ad3f34 in std::sys::unix::thread::Thread::new::h1ac6cc45154cc15f (/usr/local/bin/clusterd+0x14063f34) (BuildId: 0748fb28db213eff)
#2 0xaaaac1c23238 in <tokio::runtime::blocking::pool::Spawner>::spawn_task (/usr/local/bin/clusterd+0x141b3238) (BuildId: 0748fb28db213eff)
#3 0xaaaac1c18920 in <tokio::runtime::blocking::pool::Spawner>::spawn_blocking::<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}, ()> (/usr/local/bin/clusterd+0x141a8920) (BuildId: 0748fb28db213eff)
#4 0xaaaac1c12cc4 in tokio::runtime::blocking::pool::spawn_blocking::<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}, ()> (/usr/local/bin/clusterd+0x141a2cc4) (BuildId: 0748fb28db213eff)
#5 0xaaaac1c4bb48 in <tokio::runtime::scheduler::multi_thread::worker::Launch>::launch (/usr/local/bin/clusterd+0x141dbb48) (BuildId: 0748fb28db213eff)
#6 0xaaaac1c6d024 in <tokio::runtime::builder::Builder>::build (/usr/local/bin/clusterd+0x141fd024) (BuildId: 0748fb28db213eff)
#7 0xaaaabc5bc618 in clusterd::main (/usr/local/bin/clusterd+0xeb4c618) (BuildId: 0748fb28db213eff)
#8 0xaaaabbf15e44 in std::sys_common::backtrace::__rust_begin_short_backtrace::<fn(), ()> (/usr/local/bin/clusterd+0xe4a5e44) (BuildId: 0748fb28db213eff)
#9 0xaaaabc5bd580 in main (/usr/local/bin/clusterd+0xeb4d580) (BuildId: 0748fb28db213eff)
#10 0xffff9c5f74c8 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#11 0xaaaab1664034 in _start (/usr/local/bin/clusterd+0x3bf4034) (BuildId: 0748fb28db213eff)
#12 0xffffc3455de9 ([stack]+0x59de9)
Thread T27 created by T1 here:
#0 0xaaaab1708108 in pthread_create (/usr/local/bin/clusterd+0x3c98108) (BuildId: 0748fb28db213eff)
#1 0xaaaac1ad3f34 in std::sys::unix::thread::Thread::new::h1ac6cc45154cc15f (/usr/local/bin/clusterd+0x14063f34) (BuildId: 0748fb28db213eff)
#2 0xaaaaba576de4 in timely_communication::allocator::zero_copy::initialize::initialize_networking_from_sockets::<std::net::tcp::TcpStream> (/usr/local/bin/clusterd+0xcb06de4) (BuildId: 0748fb28db213eff)
#3 0xaaaabbec1924 in mz_cluster::communication::initialize_networking_inner::<std::net::tcp::TcpStream> (/usr/local/bin/clusterd+0xe451924) (BuildId: 0748fb28db213eff)
#4 0xaaaabcb0a244 in mz_cluster::communication::initialize_networking::{closure#0} (/usr/local/bin/clusterd+0xf09a244) (BuildId: 0748fb28db213eff)
#5 0xaaaabcb1034c in <mz_cluster::server::ClusterClient<mz_service::client::Partitioned<mz_service::local::LocalClient<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse, std::thread::Thread>, mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>, mz_storage::server::Config, mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse> as mz_service::client::GenericClient<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::send::{closure#0} (/usr/local/bin/clusterd+0xf0a034c) (BuildId: 0748fb28db213eff)
#6 0xaaaabcb24450 in <alloc::boxed::Box<dyn mz_storage_client::client::StorageClient> as mz_service::client::GenericClient<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::send::{closure#0} (/usr/local/bin/clusterd+0xf0b4450) (BuildId: 0748fb28db213eff)
#7 0xaaaabc46c6e0 in <async_stream::async_stream::AsyncStream<core::result::Result<mz_storage_client::client::ProtoStorageResponse, tonic::status::Status>, <mz_service::grpc::GrpcServer<mz_storage::server::serve::{closure#0}>>::forward_bidi_stream<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse, mz_storage_client::client::ProtoStorageCommand, mz_storage_client::client::ProtoStorageResponse>::{closure#0}::{closure#1}> as futures_core::stream::Stream>::poll_next (/usr/local/bin/clusterd+0xe9fc6e0) (BuildId: 0748fb28db213eff)
#8 0xaaaaba9b39c8 in <futures_util::stream::stream::map::Map<futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_storage_client::client::ProtoStorageResponse, tonic::status::Status>> + core::marker::Send>>>, tonic::codec::encode::encode<tonic::codec::prost::ProstEncoder<mz_storage_client::client::ProtoStorageResponse>, futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_storage_client::client::ProtoStorageResponse, tonic::status::Status>> + core::marker::Send>>>>::{closure#0}> as futures_core::stream::TryStream>::try_poll_next (/usr/local/bin/clusterd+0xcf439c8) (BuildId: 0748fb28db213eff)
#9 0xaaaabc4a8878 in <tonic::codec::encode::EncodeBody<futures_util::stream::try_stream::into_stream::IntoStream<futures_util::stream::stream::map::Map<futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_storage_client::client::ProtoStorageResponse, tonic::status::Status>> + core::marker::Send>>>, tonic::codec::encode::encode<tonic::codec::prost::ProstEncoder<mz_storage_client::client::ProtoStorageResponse>, futures_util::stream::try_stream::into_stream::IntoStream<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<mz_storage_client::client::ProtoStorageResponse, tonic::status::Status>> + core::marker::Send>>>>::{closure#0}>>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xea38878) (BuildId: 0748fb28db213eff)
#10 0xaaaabbca6a34 in <http_body::combinators::map_err::MapErr<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>, <tonic::status::Status>::map_error<tonic::status::Status>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xe236a34) (BuildId: 0748fb28db213eff)
#11 0xaaaabbca6318 in <http_body::combinators::map_err::MapErr<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>, <axum_core::error::Error>::new<tonic::status::Status>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xe236318) (BuildId: 0748fb28db213eff)
#12 0xaaaac1fe0e98 in <http_body::combinators::map_err::MapErr<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, axum_core::error::Error>, <tonic::status::Status>::map_error<axum_core::error::Error>> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0x14570e98) (BuildId: 0748fb28db213eff)
#13 0xaaaabbca9700 in <http_body::combinators::map_err::MapErr<tonic::transport::server::recover_error::MaybeEmptyBody<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, tonic::status::Status>>, <tonic::status::Status as core::convert::Into<alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>::into> as http_body::Body>::poll_data (/usr/local/bin/clusterd+0xe239700) (BuildId: 0748fb28db213eff)
#14 0xaaaab82a6e48 in <hyper::proto::h2::PipeToSendStream<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0xa836e48) (BuildId: 0748fb28db213eff)
#15 0xaaaab7cea41c in <hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0xa27a41c) (BuildId: 0748fb28db213eff)
#16 0xaaaabc39bdb4 in <tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0xe92bdb4) (BuildId: 0748fb28db213eff)
#17 0xaaaab7bbe21c in <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll (/usr/local/bin/clusterd+0xa14e21c) (BuildId: 0748fb28db213eff)
#18 0xaaaabd4b0bf8 in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::{closure#0}>, core::task::poll::Poll<()>> (/usr/local/bin/clusterd+0xfa40bf8) (BuildId: 0748fb28db213eff)
#19 0xaaaabd8e26e8 in __rust_try.llvm.4543018447158957069 (/usr/local/bin/clusterd+0xfe726e8) (BuildId: 0748fb28db213eff)
#20 0xaaaab795673c in tokio::runtime::task::raw::poll::<tracing::instrument::Instrumented<hyper::proto::h2::server::H2Stream<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<http::response::Response<http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>> + core::marker::Send>>, http_body::combinators::box_body::UnsyncBoxBody<bytes::bytes::Bytes, alloc::boxed::Box<dyn core::error::Error + core::marker::Sync + core::marker::Send>>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>> (/usr/local/bin/clusterd+0x9ee673c) (BuildId: 0748fb28db213eff)
#21 0xaaaac1c4cb78 in <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task (/usr/local/bin/clusterd+0x141dcb78) (BuildId: 0748fb28db213eff)
#22 0xaaaac1c409d4 in <tokio::runtime::context::scoped::Scoped<tokio::runtime::scheduler::Context>>::set::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}, ()> (/usr/local/bin/clusterd+0x141d09d4) (BuildId: 0748fb28db213eff)
#23 0xaaaac1d32d4c in tokio::runtime::context::set_scheduler::<(), tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure#0}> (/usr/local/bin/clusterd+0x142c2d4c) (BuildId: 0748fb28db213eff)
#24 0xaaaac1c5c81c in tokio::runtime::context::runtime::enter_runtime::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}, ()> (/usr/local/bin/clusterd+0x141ec81c) (BuildId: 0748fb28db213eff)
#25 0xaaaac1c4bfd8 in tokio::runtime::scheduler::multi_thread::worker::run (/usr/local/bin/clusterd+0x141dbfd8) (BuildId: 0748fb28db213eff)
#26 0xaaaac1d0bb00 in <tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>> as core::future::future::Future>::poll (/usr/local/bin/clusterd+0x1429bb00) (BuildId: 0748fb28db213eff)
#27 0xaaaac1cc6290 in <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll (/usr/local/bin/clusterd+0x14256290) (BuildId: 0748fb28db213eff)
#28 0xaaaac1d1eeb8 in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>::{closure#0}>, core::task::poll::Poll<()>> (/usr/local/bin/clusterd+0x142aeeb8) (BuildId: 0748fb28db213eff)
#29 0xaaaac1d284d8 in __rust_try.llvm.15611257329881442963 (/usr/local/bin/clusterd+0x142b84d8) (BuildId: 0748fb28db213eff)
#30 0xaaaac1c054b4 in <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll (/usr/local/bin/clusterd+0x141954b4) (BuildId: 0748fb28db213eff)
#31 0xaaaac1cac0b8 in <tokio::runtime::task::UnownedTask<tokio::runtime::blocking::schedule::BlockingSchedule>>::run (/usr/local/bin/clusterd+0x1423c0b8) (BuildId: 0748fb28db213eff)
#32 0xaaaac1c2f990 in std::sys_common::backtrace::__rust_begin_short_backtrace::<<tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}, ()> (/usr/local/bin/clusterd+0x141bf990) (BuildId: 0748fb28db213eff)
#33 0xaaaac1d2099c in std::panicking::try::do_call::<core::panic::unwind_safe::AssertUnwindSafe<<std::thread::Builder>::spawn_unchecked_<<tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}, ()>::{closure#1}::{closure#0}>, ()> (.llvm.15611257329881442963) (/usr/local/bin/clusterd+0x142b099c) (BuildId: 0748fb28db213eff)
#34 0xaaaac1d284d8 in __rust_try.llvm.15611257329881442963 (/usr/local/bin/clusterd+0x142b84d8) (BuildId: 0748fb28db213eff)
#35 0xaaaac1c2ff5c in <<std::thread::Builder>::spawn_unchecked_<<tokio::runtime::blocking::pool::Spawner>::spawn_thread::{closure#0}, ()>::{closure#1} as core::ops::function::FnOnce<()>>::call_once::{shim:vtable#0} (/usr/local/bin/clusterd+0x141bff5c) (BuildId: 0748fb28db213eff)
#36 0xaaaac1ad40f4 in std::sys::unix::thread::Thread::new::thread_start::hc0f701f0c9161618 (/usr/local/bin/clusterd+0x140640f4) (BuildId: 0748fb28db213eff)
#37 0xffff9c6b5ed8 (/lib/aarch64-linux-gnu/libc.so.6+0xe5ed8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
Thread T1 created by T0 here:
#0 0xaaaab1708108 in pthread_create (/usr/local/bin/clusterd+0x3c98108) (BuildId: 0748fb28db213eff)
#1 0xaaaac1ad3f34 in std::sys::unix::thread::Thread::new::h1ac6cc45154cc15f (/usr/local/bin/clusterd+0x14063f34) (BuildId: 0748fb28db213eff)
#2 0xaaaac1c23238 in <tokio::runtime::blocking::pool::Spawner>::spawn_task (/usr/local/bin/clusterd+0x141b3238) (BuildId: 0748fb28db213eff)
#3 0xaaaac1c18920 in <tokio::runtime::blocking::pool::Spawner>::spawn_blocking::<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}, ()> (/usr/local/bin/clusterd+0x141a8920) (BuildId: 0748fb28db213eff)
#4 0xaaaac1c12cc4 in tokio::runtime::blocking::pool::spawn_blocking::<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}, ()> (/usr/local/bin/clusterd+0x141a2cc4) (BuildId: 0748fb28db213eff)
#5 0xaaaac1c4bb48 in <tokio::runtime::scheduler::multi_thread::worker::Launch>::launch (/usr/local/bin/clusterd+0x141dbb48) (BuildId: 0748fb28db213eff)
#6 0xaaaac1c6d024 in <tokio::runtime::builder::Builder>::build (/usr/local/bin/clusterd+0x141fd024) (BuildId: 0748fb28db213eff)
#7 0xaaaabc5bc618 in clusterd::main (/usr/local/bin/clusterd+0xeb4c618) (BuildId: 0748fb28db213eff)
#8 0xaaaabbf15e44 in std::sys_common::backtrace::__rust_begin_short_backtrace::<fn(), ()> (/usr/local/bin/clusterd+0xe4a5e44) (BuildId: 0748fb28db213eff)
#9 0xaaaabc5bd580 in main (/usr/local/bin/clusterd+0xeb4d580) (BuildId: 0748fb28db213eff)
#10 0xffff9c5f74c8 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: 3ba44e06b9dc66aeeb2651db4dd015ffaf6e0849)
#11 0xaaaab1664034 in _start (/usr/local/bin/clusterd+0x3bf4034) (BuildId: 0748fb28db213eff)
#12 0xffffc3455de9 ([stack]+0x59de9)
SUMMARY: AddressSanitizer: double-free (/usr/local/bin/clusterd+0x3cb6500) (BuildId: 0748fb28db213eff) in cfree.part.8
==7==ABORTING
Double-frees are worth tracking down, but I'm not sure about the last one:
#4 0xaaaac1aca7c8 in std::process::exit::h851436d6d5fce807 (/usr/local/bin/clusterd+0x1405a7c8) (BuildId: 0748fb28db213eff)
#5 0xaaaac0e63bc0 in mz_timely_util::panic::halt_on_timely_communication_panic::{closure#0} (/usr/local/bin/clusterd+0x133f3bc0) (BuildId: 0748fb28db213eff)
#6 0xaaaac1ace80c in std::panicking::rust_panic_with_hook::h49c448744bd13368 (/usr/local/bin/clusterd+0x1405e80c) (BuildId: 0748fb28db213eff)
This indicates that we're about to halt the process, so I'm not particularly worried about this one -- I could be wrong in that this double free would show up elsewhere, too.
This fits well with what we've been seeing in CI, mostly the segfaults happened without any consequence, so being during a halt makes sense.
I still think it's interesting to fix this, primarily so that we can find other double-frees. Currently this and one other asan issue are spamming the logs so much that it's hard to tell if a new problem pops up.
Still happening relatively often in CI, for example: https://buildkite.com/materialize/release-qualification/builds/507#018fec15-e04b-453a-a901-ef7dd01fafc5 In the release qualification this is especially painful since we have to restart the run, which will take many hours, and we don't get to see the panic having the double free, which might have been interesting. Full gdb backtrace: https://gist.github.com/def-/3236b2b899baa61e7e0287824f278d7d
This is so far the smallest Zippy scenario I could reproduce this with:
class ClusterReplicas(Scenario):
"""A Zippy test that uses CREATE / DROP REPLICA and random killing."""
def bootstrap(self) -> list[ActionOrFactory]:
return super().bootstrap() + [
DropDefaultReplica,
CreateReplica,
]
def actions_with_weight(self) -> dict[ActionOrFactory, float]:
return {
CreateReplica: 10,
DropReplica: 10,
}
With this repro the segfault always occurs in a process of a multi-process replica, after that process has invoked the halt_on_timely_communication_panic
panic hook. So it seems to be the same double free that Dennis found above. I'm still not closer to figuring out what causes it though. Some exception-unsafe unsafe
code somewhere?
Is there any chance that this is caused by the problem that is being fixed in https://github.com/MaterializeInc/materialize/pull/27497 ?
Is there any chance that this is caused by the problem that is being fixed in https://github.com/MaterializeInc/materialize/pull/27497 ?
Unlikely because the code shouldn't be enabled in CI.
I'm pretty certain by now that jemalloc profiling produces at least one of the segfaults. In the backtraces above, and in every instance I've seen with the ClusterReplicas
zippy scenario, the segfault always happens in _rjem_je_prof_free_sampled_object
in src/prof.c:158
. That's this line, fwiw: https://github.com/jemalloc/jemalloc/blob/e13ca993e8ccb9ba9847cc330696e02839f328f7/src/prof.c#L158
It's unlikely that the segfault would always occur at the same place if this was just jemalloc stumbling over a memory corruption introduced elsewhere. I'd expect random memory accesses to segfault in that case.
Switching off jemalloc profiling makes the ClusterReplicas
scenario pass at least. I have a branch here, in case anyone wants to do more testing: https://github.com/MaterializeInc/materialize/pull/27603
I suspect the double free Dennis points out above is a different issue, given that it occurred in a different test and I assume had jemalloc disabled.
Opened an issue for the jemalloc segfault: https://github.com/jemalloc/jemalloc/issues/2662
It's unlikely that the segfault would always occur at the same place if this was just jemalloc stumbling over a memory corruption introduced elsewhere. I'd expect random memory accesses to segfault in that case.
... that said, the segfault does occur in prof_free_sampled_object
, which gets invoked on jemalloc's deallocation path. So it could be caused by a double free after all.
I managed to create a repro that can reproduce the same segfault with a double free:
[package]
name = "jemalloc-test"
version = "0.1.0"
edition = "2021"
[dependencies]
tikv-jemallocator = { version = "*", features = ["profiling"] }
use std::alloc::{self, Layout};
use tikv_jemallocator::Jemalloc;
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;
#[export_name = "_rjem_malloc_conf"]
pub static MALLOC_CONF: &[u8] = b"prof:true,lg_prof_sample:0\0";
fn main() {
unsafe {
let layout = Layout::from_size_align(1000, 1).unwrap();
let ptr = alloc::alloc(layout);
println!("ptr={ptr:?}");
alloc::dealloc(ptr, layout);
alloc::dealloc(ptr, layout);
}
}
Without profiling enabled, this program does not segfault. With profiling enabled it segfaults with this backtrace:
#0 0x0000aaaaaacc435c in _rjem_je_prof_free_sampled_object (tsd=0xfffff7ff0770,
usize=1024, prof_info=0xffffffffe528) at src/prof.c:158
#1 0x0000aaaaaabc2594 in prof_free (alloc_ctx=0xffffffffe4e0, usize=1024,
ptr=0xfffff7820000, tsd=0xfffff7ff0770)
at include/jemalloc/internal/prof_inlines.h:257
#2 isfree (slow_path=false, tcache=0xfffff7ff0ac8, usize=1024, ptr=0xfffff7820000,
tsd=0xfffff7ff0770) at src/jemalloc.c:2979
#3 _rjem_je_sdallocx_default (ptr=0xfffff7820000, size=1000, flags=0)
at src/jemalloc.c:3924
#4 0x0000aaaaaabd0260 in _rjem_sdallocx (ptr=0xfffff7820000, size=1000, flags=0)
at src/jemalloc.c:3939
#5 0x0000aaaaaaaaf650 in tikv_jemallocator::{impl#0}::dealloc (
self=0xaaaaaadeb7d2, ptr=0xfffff7820000, layout=...)
at /home/jan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tikv-jemallocator-0.5.4/src/lib.rs:129
#6 0x0000aaaaaaaaed44 in jemalloc_test::_::__rust_dealloc (ptr=0xfffff7820000,
size=1000, align=1) at src/main.rs:6
#7 0x0000aaaaaaaaf064 in alloc::alloc::dealloc (ptr=0xfffff7820000, layout=...)
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/alloc.rs:119
#8 0x0000aaaaaaaaecc0 in jemalloc_test::main () at src/main.rs:17
Which is consistent with everything observed above.
What's strange is that running the repro with asan enabled doesn't encounter any memory issues at all: https://buildkite.com/materialize/release-qualification/builds/519. I would have expected the double free to show up here.
In other news, it looks like segfault is absent in a version of Mz that has rocksdb ripped out (and jemalloc profiling enabled): https://buildkite.com/materialize/release-qualification/builds/522. That points to rocksdb as the thing that causes the double free. This would be consistent with Dennis' asan output above, which shows that the double free happens on memory allocated in __cxa_atexit
. Rocksdb makes heavy use of that, likely to register global/static destructors.
I think I know what's happening now. Take another look at this backtrace I collected in one of the various times I managed to reproduce the segfault:
#0 0x0000aaaab33b7170 in _rjem_je_prof_free_sampled_object (tsd=tsd@entry=0xffff4ddeb110, ptr=ptr@entry=0xffff90636000, usize=usize@entry=4096, prof_info=prof_info@entry=0xffff4dde98f8)
at src/prof.c:194
#1 0x0000aaaab336e7c8 in prof_free (alloc_ctx=<synthetic pointer>, usize=4096, ptr=0xffff90636000, tsd=0xffff4ddeb110) at include/jemalloc/internal/prof_inlines.h:284
#2 ifree (slow_path=false, tcache=0xffff4ddeb470, ptr=0xffff90636000, tsd=0xffff4ddeb110) at src/jemalloc.c:2920
#3 _rjem_je_free_default (ptr=0xffff90636000) at src/jemalloc.c:3028
#4 0x0000aaaab336efb4 in je_free_impl (ptr=0xffff90636000) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:593
#5 0x0000ffff90b5cecc in __run_exit_handlers (status=2, listp=0xffff90cbb730 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
at ./stdlib/exit.c:128
#6 0x0000ffff90b5cf0c in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#7 0x0000aaaab3152898 in std::sys::pal::unix::os::exit () at library/std/src/sys/pal/unix/os.rs:750
#8 0x0000aaaab314bc84 in std::process::exit () at library/std/src/process.rs:2290
The second free of the double free happens in glibc's __run_exit_handlers
. I initially assumed that it was caused by one of the exit handlers, but if we follow the line number, we end up here: https://github.com/bminor/glibc/blob/f94f6d8a3572840d3ba42ab9ace3ea522c99c0c2/stdlib/exit.c#L128
So the problematic free
is performed by __run_exit_handlers
to free a block of the exit_function_list
linked list. It's unlikely that any code outside of glibc would attempt to also free this memory, so we are looking at a bug in glibc itself.
__run_exit_handlers
protects the manipulation of the exit_function_list
by a lock, __exit_funcs_lock
, to be thread safe. However, it unlocks the lock every time it calls one of the registered exit handlers, giving other threads an opportunity to enter the function's critical section and start walking the exit_function_list
. It is ensured that concurrent threads don't call the same exit handler multiple time, but nothing ensures that multiple threads don't free the same block of exit_function_list
.
Specifically, this can happen:
std::process::exit()
, which calls __run_exit_handlers
in glibc.__exit_funcs_lock
(source) and gets a reference to the first block of exit_function_list
(source).__exit_funcs_lock
(source), and calls the exit handler.std::process::exit()
, which calls __run_exit_handlers
in glibc.__exit_funcs_lock
(source) and gets a reference to the first block of exit_function_list
(source).exit_function_list
, locking and unlocking the __exit_funcs_lock
as they go. Eventually, they both reach the end of the block, lets say thread 1 is first.free
on the first block of exit_function_list
(source).free
on the first block of exit_function_list
(source), hence causing a double free.This explanation is consistent with what the backtrace above, and also with Dennis' asan report. It's not consistent with other backtraces posted here, but they all seem corrupted, so I guess we can't trust those. It also explains why removing rocksdb makes the segfault go away: Rocksdb is the only thing in clusterd
that installs exit handlers, so without it there are no exit handlers (or maybe one installed by libc itself) and the above sequence of events is only possible if there is more than one block in the exit_function_list
, which requires more than 32 installed exit handlers.
Somewhat relatedly, I also found this issue which opens a rabbit hole about the thread safety of std::process::exit
: https://github.com/rust-lang/rust/issues/83994. The TLDR is that std::process::exit
calls libc's exit
function which is thread safe (modulo bugs like the one described above) but there is nothing to enforce that the exit handlers are also thread safe. You have to rely on correctly implemented C/C++ libraries or bring your own locking, which std::process::exit
doesn't currently do.
This makes me think it is generally a bad idea to allow std::process::exit
to be executed from multiple threads, or even just from one thread when other threads are still running concurrently. There is no indication that rocksdb installs exit handlers that are not thread safe, but we still open us up to issues in the future, especially if we start depending on other C++ libraries.
A simple fix for the halt!
macro, which is what triggered the segfault reported here, is to use glibc's _exit
function , instead of exit
, which doesn't call the exit handlers and just terminates the process like abort
does. I'm trying this out here: https://github.com/MaterializeInc/materialize/pull/27705
This makes me think it is generally a bad idea to allow
std::process::exit
to be executed from multiple threads
Yes, it looks like that is specifically UB. From a comment in https://github.com/rust-lang/rust/issues/83994:
One further thing to note: while exit is required to be thread-safe, the behavior is undefined if it's called more than once (from the same thread or concurrently from multiple ones). If Rust is not already handling this in some way to ensure safety, it probably should.
So I think our two options are:
_exit
directly, like you're doing in #27705.ore::process::exit
) which acquires a global exit lock of our creation before calling std::process::exit
.Do we lose anything by not calling the exit handlers? If not, then (1) sounds like the way to go.
Found a minimal repro for the above use-after-free/double free in glibc:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
void exit_handler() {
sleep(1);
printf("at exit\n");
}
void *run_thread(void *arg) {
exit(2);
}
int main() {
for (int i = 0; i < 1000; i++) {
atexit(exit_handler);
}
for (int i = 0; i < 10; i++) {
pthread_t thread_id;
pthread_create(&thread_id, NULL, run_thread, NULL);
}
sleep(10);
}
This segfaults almost immediately with the following backtrace on my machine.
Do we lose anything by not calling the exit handlers? If not, then (1) sounds like the way to go.
I don't think so! As I understand it, halt!
is intended as "an expected abort", i.e. we would normally use abort (on panic) but we can't do that because we don't want to trigger all the crash alerting mechanisms. Abort doesn't run the exit handlers either, so I see no reason why we'd need to run them on halt. Plus, any code that relies on exit handlers to run is broken anyway, since it's not robust against hardware crashes.
What version of Materialize are you using?
39d3f407f96761a65fcc836caf47300e416f0f68
What is the issue?
https://buildkite.com/materialize/nightlies/builds/4013#_ Recently we started seeing segfaults in timely during data-ingest test:
We don't have coredumps in CI yet (https://github.com/MaterializeInc/materialize/pull/19961), so I'll try to reproduce this locally.