Open def- opened 2 weeks ago
The log about finalization should be unrelated to the hung query, plus I think that's a think that would already occur before, it's just that I added logging for it in https://github.com/MaterializeInc/materialize/pull/27057.
The failure might be related to another recent change of mine, though: https://github.com/MaterializeInc/materialize/pull/27496. Though that could also be a red herring.
Do we know what the query returns? From that log snippet it looks like we do our regular retries but never see what we want, as opposed to the query just not returning, right?
I think the query returned a result once, but it was the wrong result. Then the second try hangs.
Why do we expect that much logging? It's spamming services.log in CI.
The only other relevant logging I see is:
testdrive-materialized-1 | cluster-u1-replica-u1-gen-0: 2024-06-24T11:30:38.769686Z INFO mz_persist_client::internal::machine: snapshot sceed0578-efae-456b-8a30-d4e0dfbbd885 as of [1719228636768] not yet available for v15 upper [1719228636001]
testdrive-materialized-1 | cluster-u34-replica-u36-gen-0: 2024-06-24T11:38:03.047972Z WARN mz_persist_client::internal::machine: heartbeat task for reader (r2e3de075-faa6-438a-a9f2-df60a78e6721) of shard (s69269886-7262-4537-a4ed-2b55eb7a4b66) exiting due to expired lease while read handle is live
testdrive-materialized-1 | cluster-u34-replica-u36-gen-0: 2024-06-24T11:38:03.048097Z WARN mz_persist_client::internal::machine: heartbeat task for reader (r2e3de075-faa6-438a-a9f2-df60a78e6721) of shard (s69269886-7262-4537-a4ed-2b55eb7a4b66) exiting due to expired lease while read handle is live
testdrive-materialized-1 | cluster-u34-replica-u36-gen-0: 2024-06-24T11:38:03.073256Z WARN mz_persist_client::internal::machine: heartbeat task for reader (r2548ce70-3e72-4b97-a48d-bac9e81aa61e) of shard (s017ca4a9-0bb4-4ac4-b981-2b62459066e1) exiting due to expired lease while read handle is live
testdrive-materialized-1 | cluster-u34-replica-u36-gen-0: 2024-06-24T11:38:03.073380Z WARN mz_persist_client::internal::machine: heartbeat task for reader (r2548ce70-3e72-4b97-a48d-bac9e81aa61e) of shard (s017ca4a9-0bb4-4ac4-b981-2b62459066e1) exiting due to expired lease while read handle is live
Another occurrence in a different testdrive test: https://buildkite.com/materialize/test/builds/84505#01904b67-1af9-454b-8118-b14f953ea5aa Edit: again: https://buildkite.com/materialize/test/builds/84526#01904cb6-156c-4e63-9255-65a9a4ee00c1
Again the only relevant logging other than background finalization errors:
testdrive-materialized-1 | cluster-u1-replica-u1-gen-0: 2024-06-25T00:11:13.370574Z INFO mz_persist_client::internal::machine: snapshot sdc9c8941-19c0-483d-9188-d823dff454ac as of [1719274271519] not yet available for v37 upper [1719274271001]
This seems to be a recent regression.
I'm marking this as a release blocker since it's occurring often in CI, seems to be a recent regression. I'll try to get a reproducer locally and then can hopefully figure out more: while true; do BUILDKITE_PARALLEL_JOB="6" BUILDKITE_PARALLEL_JOB_COUNT="8" bin/mzcompose --find testdrive run default || break; done
Edit: I couldn't reproduce this yet. One idea for what could happen is that Mz is just overloaded, but this should not happen because we clear most resources between individual td files.
Got it reproduced locally with the same command, Mz is now stuck:
> CREATE SOURCE up_with_update
IN CLUSTER lg_cluster
FROM LOAD GENERATOR KEY VALUE (
KEYS 16,
PARTITIONS 4,
SNAPSHOT ROUNDS 3,
SEED 123,
VALUE SIZE 10,
BATCH SIZE 2,
TICK INTERVAL '1s'
)
ENVELOPE UPSERT
rows match; continuing at ts 1719303066.6064289
> SELECT partition, count(*) FROM up_with_update GROUP BY partition
rows match; continuing at ts 1719303067.212674
> SELECT
encode(value, 'base64') != 'jVfAtykQHFlCMA=='
FROM up_with_update
WHERE
key = 14
rows didn't match; sleeping to see if dataflow catches up 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s
clusterds are stuck at using full CPU:
materialize=> select 1;
?column?
----------
1
(1 row)
materialize=> explain SELECT
encode(value, 'base64') != 'jVfAtykQHFlCMA=='
FROM up_with_update
WHERE
key = 14;
Optimized Plan
----------------------------------------------------------
Explained Query: +
Project (#3) +
Filter (14 = bigint_to_numeric(#0)) +
Map (("jVfAtykQHFlCMA==" != encode(#2, "base64")))+
ReadStorage materialize.public.up_with_update +
+
Source materialize.public.up_with_update +
filter=((14 = bigint_to_numeric(#0))) +
+
Target cluster: mz_system +
(1 row)
materialize=> select * from mz_internal.mz_source_statuses;
id | name | type | last_status_change_at | status | error | details
-------+-------------------------+----------------+----------------------------+---------+-------+---------
u4159 | up_with_update_progress | progress | | running | |
u4160 | up_with_update | load-generator | 2024-06-25 08:11:06.957+00 | running | |
(2 rows)
Since this always fails with load generators, maybe this shouldn't be a release blocker?
Full gdb backtraces: https://gist.github.com/def-/a3ebfe3d46a31c945bf27c5598a66a15
Does it repro when just running these commands? Or you need to run testdrive in a loop for a bit?
Needed the loop for an hour or more. I couldn't find an easier way to repro it. (Tried while true; do bin/mzcompose --find testdrive run default load-generator-key-value.td || break; done
unsuccessfully for a while)
I'm un-assigning for now because this looks like an issue with the load generator source, right?
It seems so, reassigning to Storage team.
FWIW, I dont think this should be a release-blocker if this only applies to the key-value load gen source. That is behind a feature-flag that is not currently enabled in production https://app.launchdarkly.com/projects/default/flags/enable_load_generator_key_value/targeting?env=production&env=staging&env=development&env=loadtest&env=ci-cd&selected-env=production
I re-added the release blocker as we don't have evidence yet that this is key-value load gen specific. I inspected the stacktraces captured by @def- and you can see that the only threads that are not parked or in a syscall are four timely threads that all have this stack:
Thread 74 (LWP 49592 "timely:work-3"):
#0 0x0000601daf49e8ce in mz_storage::upsert::upsert_inner::{closure#3}::{async_block#0}::{closure#3}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::sco
pes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::S
ubtime)>, mz_storage_types::sources::MzOffset, mz_storage::upsert::upsert::{closure_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child:
:Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, m
z_storage_types::sources::MzOffset>, mz_storage::upsert::upsert::{closure#0}::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes:
:child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subti
me)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::rocksdb::RocksDB<core::option::Option<mz_storage_types::sources::MzOffset>>> (cx=0x73e9f7ff09c0) at /cargo/registry/src/index.crates.io-6
f17d22bba15001f/tokio-1.32.0/src/macros/select.rs:510
#1 tokio::future::poll_fn::{impl#1}::poll<mz_storage::upsert::upsert_inner::{closure#3}::{async_block#0}::__tokio_select_util::Out<core::option::Option<mz_timely_util::builder_async::Event<(mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime), timely::dataflow::operators::capability::Capability<(mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, alloc::vec::Vec<((mz_storage::upsert::UpsertKey, core::result::Result<mz_repr::row::Row, mz_storage_types::errors::UpsertError>), (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime), i64), alloc::alloc::Global>>>, core::option::Option<mz_timely_util::builder_async::Event<(mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime), timely::dataflow::operators::capability::Capability<(mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, alloc::vec::Vec<((mz_storage::upsert::UpsertKey, core::option::Option<core::result::Result<mz_repr::row::Row, mz_storage_types::errors::UpsertError>>, mz_storage_types::sources::MzOffset), (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime), i64), alloc::alloc::Global>>>>, mz_storage::upsert::upsert_inner::{closure#3}::{async_block#0}::{closure_env#3}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset, mz_storage::upsert::upsert::{closure_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::upsert::{closure#0}::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::rocksdb::RocksDB<core::option::Option<mz_storage_types::sources::MzOffset>>>> (cx=0x73e9f7ff09c0, self=...) at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/poll_fn.rs:58
#2 mz_storage::upsert::upsert_inner::{closure#3}::{async_block#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset, mz_storage::upsert::upsert::{closure_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::upsert::{closure#0}::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::rocksdb::RocksDB<core::option::Option<mz_storage_types::sources::MzOffset>>> () at src/storage/src/upsert.rs:707
#3 0x0000601dae68be02 in core::future::future::{impl#1}::poll<alloc::boxed::Box<mz_storage::upsert::upsert_inner::{closure#3}::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset, mz_storage::upsert::upsert::{closure_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::upsert::{closure#0}::{async_block_env#0}<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_repr::timestamp::Timestamp>, (mz_repr::timestamp::Timestamp, mz_storage_operators::persist_source::Subtime)>, mz_storage_types::sources::MzOffset>, mz_storage::upsert::rocksdb::RocksDB<core::option::Option<mz_storage_types::sources::MzOffset>>>, alloc::alloc::Global>> (self=..., cx=0x73e9f7ff09c0) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123
The stack trace points to this line https://github.com/MaterializeInc/materialize/blob/448764fc5afe15e07ab4aeda41691a5ec3a0f855/src/storage/src/upsert.rs#L707 which suggests that there are inputs with which the upsert operator can get in this situation.
Another curious thing visible in Dennis' screenshot is that there are four timely workers, which in theory are single threaded, consuming the entirety of 16 cores. This must mean that the thing we're busy working must be some work scheduled in a threadpool. rocksdb is enabled in that test so it could be that the rocksdb threadpool is stuck somewhere. I don't see the rocksdb threads being busy in the backtraces though..
Hmm this reminds me -- I did enable the storage_rocksdb_use_merge_operator
flag in production last week but it's still disabled by default in the code so I think these CI runs would still be running with the flag turned off, unless there is some LD -> CI configuration I'm not aware of
I see this in the logs rocksdb_use_native_merge_operator=false
which I assume means it's not enabled, right?
correct, not enabled, so likely unrelated then
The reproducer I had before didn't work a second time. I got into stuck state again like this:
ALTER SYSTEM SET storage_statistics_collection_interval = 1000;
ALTER SYSTEM SET storage_statistics_interval = 2000;
ALTER SYSTEM SET enable_load_generator_key_value = true;
DROP CLUSTER lg_cluster2 CASCADE;
CREATE CLUSTER lg_cluster2 SIZE '4-4';
CREATE SOURCE up_no_update
IN CLUSTER lg_cluster2
FROM LOAD GENERATOR KEY VALUE (
KEYS 16,
PARTITIONS 4,
SNAPSHOT ROUNDS 3,
SEED 123,
VALUE SIZE 10,
BATCH SIZE 2
)
ENVELOPE UPSERT;
CREATE SOURCE up_quick
IN CLUSTER lg_cluster2
FROM LOAD GENERATOR KEY VALUE (
KEYS 16,
PARTITIONS 4,
SNAPSHOT ROUNDS 3,
TRANSACTIONAL SNAPSHOT false,
SEED 123,
VALUE SIZE 10,
BATCH SIZE 2
)
INCLUDE KEY AS whatever
ENVELOPE UPSERT;
SELECT partition, count(*) FROM up_no_update GROUP BY partition;
SELECT MAX(key) FROM up_no_update;
SELECT partition, count(*) FROM up_quick GROUP BY partition;
SELECT MAX(whatever) FROM up_quick;;
SELECT
s.name,
u.offset_known,
u.offset_committed,
u.snapshot_records_known,
u.snapshot_records_staged,
u.messages_received,
u.records_indexed
FROM mz_sources s
JOIN mz_internal.mz_source_statistics u ON s.id = u.id
WHERE s.name IN ('up_no_update', 'up_quick');
SELECT
encode(value, 'base64')
FROM up_no_update
WHERE
key = 14;
SELECT
encode(value, 'base64')
FROM up_quick
WHERE
whatever = 14;
ALTER CLUSTER lg_cluster2 SET (REPLICATION FACTOR 0);
ALTER CLUSTER lg_cluster2 SET (REPLICATION FACTOR 1);
SELECT
encode(value, 'base64') = '${pre-rehydration}'
FROM up_no_update
WHERE
key = 14
> SELECT
encode(value, 'base64') = '${pre-rehydration-quick}'
FROM up_quick
WHERE
whatever = 14
SELECT
s.name,
u.offset_known,
u.offset_committed,
u.snapshot_records_known,
u.snapshot_records_staged,
u.messages_received,
u.records_indexed
FROM mz_sources s
JOIN mz_internal.mz_source_statistics u ON s.id = u.id
WHERE s.name IN ('up_no_update', 'up_quick');
DROP SOURCE up_no_update;
DROP SOURCE up_quick;
CREATE SOURCE up_with_update2
IN CLUSTER lg_cluster2
FROM LOAD GENERATOR KEY VALUE (
KEYS 128,
PARTITIONS 4,
SNAPSHOT ROUNDS 2,
SEED 123,
VALUE SIZE 10,
BATCH SIZE 4,
TICK INTERVAL '1s'
)
ENVELOPE UPSERT;
SELECT partition, count(*) FROM up_with_update2 GROUP BY partition;
SELECT
s.name,
u.offset_known > 3,
u.offset_committed = u.offset_known,
u.snapshot_records_known,
u.snapshot_records_staged,
u.messages_received > 48,
u.records_indexed
FROM mz_sources s
JOIN mz_internal.mz_source_statistics u ON s.id = u.id
WHERE s.name IN ('up_with_update2');
SELECT
encode(value, 'base64') != '${pre-rehydration}'
FROM up_with_update2
WHERE
key = 14;
SELECT
encode(value, 'base64')
FROM up_with_update2
WHERE
key = 14;
SELECT
encode(value, 'base64') != '${pre-rehydration-with-update}'
FROM up_with_update2
WHERE
key = 14;
And then running this in an endless loop:
SELECT encode(value, 'base64') != 'jVfAtykQHFlCMA==' FROM up_with_update2 WHERE key = 14;
What version of Materialize are you using?
448764fc5afe
What is the issue?
Seen in https://buildkite.com/materialize/test/builds/84424#019049e5-b374-427b-976f-023030ecc7cc
The test stays stuck for 30 min and never recovers until it times out. The query is stuck in Mz for 30 min too:
There were large changes to the relevant code in https://github.com/MaterializeInc/materialize/pull/27057, so I'm not sure if this PR is related or not.