Open def- opened 1 month ago
What's the evidence that this is a balancerd problem? The network error?
Although the statement history reports that the statement succeeded, that does not mean the transaction succeeded. The listed query is an INSERT .. SELECT
which is a ReadThenWrite plan which, on statement success, means that the rows to insert are stashed in the transaction. The committing of the transaction is what persists the new data to durable storage. The failure could occur during commit time. For example I see one error in that slack channel:
scalability-benchmark-connection: Uploading results failed! Failed to write to test analytics database! Cause: {'S': 'ERROR', 'C': '25001', 'M': ".. cannot be run inside a transaction block"}
The reason here is different than the one in this issue "network error", so these could be separate problems.
The "cannot be run inside a transaction block" is unrelated, I broke that and fixed it already: https://github.com/MaterializeInc/materialize/pull/28525 It was not happening at the time of the original "network error" failures.
The failure could occur during commit time.
Is there a way to see that? I guess we need a transaction history. ;) I tried searching for the query ID in grafana, but couldn't find anything being logged. The logs around that time also don't seem weird:
All of these queries are being run with autocommit=True
btw, in case it matters.
I don't think there's a way to see this besides looking in the client logs. autocommit=True is indeed what I would expect to cause this (the broken thing was implicit transaction commit, which is what autocommit does). Mostly I don't actually think it's that problem (which is fixed by https://github.com/MaterializeInc/materialize/pull/28549), it's just on my mind so I'm wondering if more things are fixed by that PR than we expect because it was silently doing bad things for a long time (in rare cases).
So if this was a transaction problem we could get more details by removing the autocommit and checking the result of each transaction? I can do that
Again, I'm doubtful that was a correct guess, it was just a recent thing I looked at. I think the most pertinent bug now is
cannot be run inside a transaction block
which means the INSERT is being executed incorrectly (with BEGIN/COMMIT). I recommend fixing that then looking at the new incoming errors.
This has already been fixed in https://github.com/MaterializeInc/materialize/pull/28525, it's just occurring in older open PRs which don't have my change yet.
What Materialize org is used to execute these queries?
Materialize Production Analytics.
Materialize Production Analytics.
What database and schema is build_job
in?
Materialize Production Analytics.
What database and schema is
build_job
in?
raw.test_analytics
FWIW the inner SELECT
query executes fairly fast on the production analytics env, even without the index.
materialize=> EXPLAIN SELECT 1
FROM raw.test_analytics.build_job
WHERE build_job_id = '0190e435-3357-4940-b8e2-560821b1acca';
Optimized Plan
-----------------------------------------------------------------------------------------
Explained Query: +
Project (#13) // { arity: 1 } +
Filter (#0{build_job_id} = "0190e435-3357-4940-b8e2-560821b1acca") // { arity: 14 }+
Map (1) // { arity: 14 } +
ReadStorage raw.test_analytics.build_job // { arity: 13 } +
+
Source raw.test_analytics.build_job +
filter=((#0{build_job_id} = "0190e435-3357-4940-b8e2-560821b1acca")) +
pushdown=((#0{build_job_id} = "0190e435-3357-4940-b8e2-560821b1acca")) +
+
Target cluster: materialize +
(1 row)
Time: 20.065 ms
materialize=> SELECT 1
FROM raw.test_analytics.build_job
WHERE build_job_id = '0190e435-3357-4940-b8e2-560821b1acca';
?column?
----------
1
(1 row)
A raw SELECT
is fast too
materialize=> EXPLAIN SELECT 1
FROM raw.test_analytics.build_job;
Optimized Plan
-----------------------------------------------------------------
Explained Query: +
Project (#13) // { arity: 1 } +
Map (1) // { arity: 14 } +
ReadStorage raw.test_analytics.build_job // { arity: 13 }+
+
Source raw.test_analytics.build_job +
+
Target cluster: materialize +
(1 row)
Time: 17.553 ms
materialize=> SELECT 1
FROM raw.test_analytics.build_job;
Time: 1491.239 ms (00:01.491)
It might get slower when someone creates a PR that fails in every test.
What version of Materialize are you using?
main
What is the issue?
This appears to be an issue in balancerd. Here is a query that failed: https://materializeinc.slack.com/archives/C05CQUVHY48/p1721818953826099
Buildkite link: https://buildkite.com/materialize/test/builds/86589#0190e435-3357-4940-b8e2-560821b1acca
But the query has actually successfully run in Materialize: https://console.materialize.com/regions/aws-us-east-1/query-history/84977a46-e4e0-4f5f-8982-3acd2f6d2685 via the Query History: We are getting this error all the time in CI, check out the #team-testing-bots Slack channel for a list. From QA side we should scan our old runs to see when this started happening. We only noticed now because we only just started posting such failures in a Slack channel: https://github.com/MaterializeInc/materialize/pull/28445