MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

thread 'timely:recv-0' panicked at 'Failed to receive MergeQueue: RecvError' #20513

Open def- opened 1 year ago

def- commented 1 year ago

What version of Materialize are you using?

b83440d41865e38a6352bbb7db30b4786f264028

What is the issue?

Seen in https://buildkite.com/materialize/nightlies/builds/2820#01894f2b-22d0-4414-acec-d121a0aab403

zippy-materialized-1     | thread 'timely:recv-0' panicked at 'Failed to receive MergeQueue: RecvError', /cargo/git/checkouts/timely-dataflow-70b80d81d6cabd62/15ac623/communication/src/allocator/zero_copy/tcp.rs:45:77

I think this is unrelated to my CRDB upgrade change in which it occurred: https://github.com/MaterializeInc/materialize/pull/20507 Retriggered the run to make sure, but I'm expecting this to be a flake: https://buildkite.com/materialize/nightlies/builds/2824

ci-regexp: (Failed to receive MergeQueue: RecvError|panicked at 'failed to send MergeQueue: "SendError(..)"')
antiguru commented 1 year ago

Just before that line:

zippy-materialized-1     | cluster-u6-replica-8: 2023-07-13T13:10:30.712636Z  WARN mz_cluster::communication: failed to initialize network: Resource temporarily unavailable (os error 11) process=3

os error 11 is EAGAIN, which we're apparently not handling.

This causes then a panic within Timely, which will take down the whole process. The orchestrator restarted the process, so it seems we ended up in a good state.

Closing because the root cause seems to be an OS issue, and we're handling it OK.

philip-stoev commented 1 year ago

If the preferred way to handle EAGAIN is to exit the process, it should not happen with a panic that is then reflected in the CI and Sentry, but with an orderly non-panic exit. So I am re-opening the ticket until the panic can be silenced properly.

antiguru commented 1 year ago

Idea: If this uses unmanaged replicas, convert to using managed replicas.

nrainer-materialize commented 11 months ago

Adding ci-regexp: panicked at 'failed to send MergeQueue: "SendError(..)"' here since https://github.com/MaterializeInc/materialize/issues/22027 is marked as a duplicate of this issue. This occurred in https://buildkite.com/materialize/tests/builds/66628#018b5d1d-0e35-494c-ac20-fd28a1f81c79.

philip-stoev commented 9 months ago

Happened in the normal CI as well https://buildkite.com/materialize/tests/builds/70500#018c3b75-1685-4bb8-b5a6-ec532814014c

nrainer-materialize commented 4 months ago

This was observed in the release-qualification build:

zippy-storaged-1      | thread 'thread 'timely:work-0timely:work-1' panicked at ' panicked at /cargo/git/checkouts/timely-dataflow-70b80d81d6cabd62/89bcb73/communication/src/allocator/process.rs/cargo/git/checkouts/timely-dataflow-70b80d81d6cabd62/89bcb73/communication/src/allocator/process.rs::4339::4033:
zippy-storaged-1      | Failed to recv buzzer: RecvError:
zippy-storaged-1      | 
zippy-storaged-1      | Failed to send buzzer: "SendError(..)"
zippy-storaged-1      | stack backtrace: