MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Test Analytics CI queries keep failing with "network error" but succeeded in Materialize #28472

Open def- opened 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

main

What is the issue?

This appears to be an issue in balancerd. Here is a query that failed: https://materializeinc.slack.com/archives/C05CQUVHY48/p1721818953826099

debezium-postgres: Uploading results failed! Failed to write to test analytics database! Cause: network error (last executed sql:
INSERT INTO build_job
(
   build_job_id,
   build_step_id,
   build_id,
   build_step_key,
   shard_index,
   retry_count,
   insert_date,
   is_latest_retry,
   success,
   aws_instance_type,
   remarks
)
SELECT
 '0190e435-3357-4940-b8e2-560821b1acca',
 '0190e435-3141-4229-9628-e1b0ca8e2d62',
 '0190e433-bdad-4495-9c3a-e7b35c130855',
 'debezium-postgres',
 NULL::INT,
 0,
 now(),
 TRUE,
 True,
 'c7g.xlarge',
 NULL
WHERE NOT EXISTS
(
   SELECT 1
   FROM build_job
   WHERE build_job_id = '0190e435-3357-4940-b8e2-560821b1acca'
);
)

Buildkite link: https://buildkite.com/materialize/test/builds/86589#0190e435-3357-4940-b8e2-560821b1acca

2024-07-24 10:48:56 UTC Updates to test analytics database
2024-07-24 10:48:56 UTC > INSERT INTO build ( pipeline, build_number, build_id, branch, commit_hash, main_ancestor_commit_hash, mz_version, date, data_version, remarks ) SELECT 'test', 86589, '0190e433-bdad-4495-9c3a-e7b35c130855', 'aljoscha:storage-read-only-sinks', '6772496f8b3c7d05b2a23184d434582513590288', 'aedc841c52ffcff4a910bf04ab67242edb23ae23', 'v0.110.0-dev', now(), 10, NULL WHERE NOT EXISTS ( SELECT 1 FROM build WHERE build_id = '0190e433-bdad-4495-9c3a-e7b35c130855' );
2024-07-24 10:48:59 UTC -- OK (0 rows affected, 2.54s)
2024-07-24 10:48:59 UTC > INSERT INTO build_job ( build_job_id, build_step_id, build_id, build_step_key, shard_index, retry_count, insert_date, is_latest_retry, success, aws_instance_type, remarks ) SELECT '0190e435-3357-4940-b8e2-560821b1acca', '0190e435-3141-4229-9628-e1b0ca8e2d62', '0190e433-bdad-4495-9c3a-e7b35c130855', 'debezium-postgres', NULL::INT, 0, now(), TRUE, True, 'c7g.xlarge', NULL WHERE NOT EXISTS ( SELECT 1 FROM build_job WHERE build_job_id = '0190e435-3357-4940-b8e2-560821b1acca' );
2024-07-24 10:49:04 UTC -- FAILED!
2024-07-24 10:49:04 UTC Upload failed.

But the query has actually successfully run in Materialize: https://console.materialize.com/regions/aws-us-east-1/query-history/84977a46-e4e0-4f5f-8982-3acd2f6d2685 via the Query History: Screenshot 2024-07-24 at 14 11 08 We are getting this error all the time in CI, check out the #team-testing-bots Slack channel for a list. From QA side we should scan our old runs to see when this started happening. We only noticed now because we only just started posting such failures in a Slack channel: https://github.com/MaterializeInc/materialize/pull/28445

maddyblue commented 1 month ago

What's the evidence that this is a balancerd problem? The network error?

Although the statement history reports that the statement succeeded, that does not mean the transaction succeeded. The listed query is an INSERT .. SELECT which is a ReadThenWrite plan which, on statement success, means that the rows to insert are stashed in the transaction. The committing of the transaction is what persists the new data to durable storage. The failure could occur during commit time. For example I see one error in that slack channel:

scalability-benchmark-connection: Uploading results failed! Failed to write to test analytics database! Cause: {'S': 'ERROR', 'C': '25001', 'M': ".. cannot be run inside a transaction block"}

The reason here is different than the one in this issue "network error", so these could be separate problems.

def- commented 1 month ago

The "cannot be run inside a transaction block" is unrelated, I broke that and fixed it already: https://github.com/MaterializeInc/materialize/pull/28525 It was not happening at the time of the original "network error" failures.

The failure could occur during commit time.

Is there a way to see that? I guess we need a transaction history. ;) I tried searching for the query ID in grafana, but couldn't find anything being logged. The logs around that time also don't seem weird: Screenshot 2024-07-27 at 00 43 25

All of these queries are being run with autocommit=True btw, in case it matters.

maddyblue commented 1 month ago

I don't think there's a way to see this besides looking in the client logs. autocommit=True is indeed what I would expect to cause this (the broken thing was implicit transaction commit, which is what autocommit does). Mostly I don't actually think it's that problem (which is fixed by https://github.com/MaterializeInc/materialize/pull/28549), it's just on my mind so I'm wondering if more things are fixed by that PR than we expect because it was silently doing bad things for a long time (in rare cases).

def- commented 1 month ago

So if this was a transaction problem we could get more details by removing the autocommit and checking the result of each transaction? I can do that

maddyblue commented 1 month ago

Again, I'm doubtful that was a correct guess, it was just a recent thing I looked at. I think the most pertinent bug now is

cannot be run inside a transaction block

which means the INSERT is being executed incorrectly (with BEGIN/COMMIT). I recommend fixing that then looking at the new incoming errors.

def- commented 1 month ago

This has already been fixed in https://github.com/MaterializeInc/materialize/pull/28525, it's just occurring in older open PRs which don't have my change yet.

jkosh44 commented 1 month ago

What Materialize org is used to execute these queries?

def- commented 1 month ago

Materialize Production Analytics.

jkosh44 commented 1 month ago

Materialize Production Analytics.

What database and schema is build_job in?

nrainer-materialize commented 1 month ago

Materialize Production Analytics.

What database and schema is build_job in?

raw.test_analytics

jkosh44 commented 1 month ago

FWIW the inner SELECT query executes fairly fast on the production analytics env, even without the index.

materialize=> EXPLAIN SELECT 1
   FROM raw.test_analytics.build_job
   WHERE build_job_id = '0190e435-3357-4940-b8e2-560821b1acca';
                                     Optimized Plan                                      
-----------------------------------------------------------------------------------------
 Explained Query:                                                                       +
   Project (#13) // { arity: 1 }                                                        +
     Filter (#0{build_job_id} = "0190e435-3357-4940-b8e2-560821b1acca") // { arity: 14 }+
       Map (1) // { arity: 14 }                                                         +
         ReadStorage raw.test_analytics.build_job // { arity: 13 }                      +
                                                                                        +
 Source raw.test_analytics.build_job                                                    +
   filter=((#0{build_job_id} = "0190e435-3357-4940-b8e2-560821b1acca"))                 +
   pushdown=((#0{build_job_id} = "0190e435-3357-4940-b8e2-560821b1acca"))               +
                                                                                        +
 Target cluster: materialize                                                            +

(1 row)

Time: 20.065 ms
materialize=> SELECT 1
   FROM raw.test_analytics.build_job
   WHERE build_job_id = '0190e435-3357-4940-b8e2-560821b1acca';
 ?column? 
----------
        1
(1 row)

A raw SELECT is fast too

materialize=> EXPLAIN SELECT 1
   FROM raw.test_analytics.build_job;
                         Optimized Plan                          
-----------------------------------------------------------------
 Explained Query:                                               +
   Project (#13) // { arity: 1 }                                +
     Map (1) // { arity: 14 }                                   +
       ReadStorage raw.test_analytics.build_job // { arity: 13 }+
                                                                +
 Source raw.test_analytics.build_job                            +
                                                                +
 Target cluster: materialize                                    +

(1 row)

Time: 17.553 ms
materialize=> SELECT 1
   FROM raw.test_analytics.build_job;
Time: 1491.239 ms (00:01.491)
def- commented 1 month ago

It might get slower when someone creates a PR that fails in every test.