MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

CREATE TABLE is stuck in cloudtest's test_replica_restart.py::test_crash_clusterd #28235

Open def- opened 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

1a1f4364bb6e

What is the issue?

This is flaking in main: https://buildkite.com/materialize/test/builds/85860#0190b3b8-8d2b-4139-9ab7-3e616ed71f7d

2024-07-15 00:27:05 UTC test/cloudtest/test_replica_restart.py::test_crash_clusterd
2024-07-15 00:27:05 UTC -------------------------------- live log call ---------------------------------
2024-07-15 00:27:05 UTC [    INFO] > DROP TABLE IF EXISTS t1 CASCADE (k8s_service.py:88)
2024-07-15 00:27:05 UTC [    INFO] > CREATE TABLE t1 (f1 TEXT) (k8s_service.py:88)
2024-07-15 00:50:07 UTC # Received cancellation signal, interrupting

@maddyblue Can you take a look? I previously also brought this up on Slack: https://materializeinc.slack.com/archives/C01LKF361MZ/p1720431691737379

I'm trying to reproduce it locally with cd test/cloudtest && ./teardown && ./setup && ./pytest --splits=8 --group=6 -m "not long and not node_recovery" but no luck so far. Would like to get logs, which are missing when cloudtest times out in CI.

def- commented 1 month ago

Now a simple SELECT stuck in Postgres CDC test: https://buildkite.com/materialize/test/builds/86174#0190c31e-f4c7-4241-9ddf-cf9d70b332b8

2024-07-18 00:38:34 UTC > SELECT * FROM unique_nullable
2024-07-18 01:02:06 UTC rows didn't match; sleeping to see if dataflow catches up 50ms# Received cancellation signal, interrupting

As usual, no logs, can't reproduce.