MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.71k stars 465 forks source link

compute: differently sized replicas can return different errors for the same queries #27223

Open teskje opened 3 months ago

teskje commented 3 months ago

What version of Materialize are you using?

main

What is the issue?

A general assumption of the compute protocol is that all replicas receive the (semantically) same stream of commands and return the (semantically) same stream of responses. This is currently not the case for dataflow errors returned in PeekResponses, SubscribeResponses, and CopyToResponses.

Reproduction:

CREATE CLUSTER test REPLICAS (
    r1 (SIZE '1'),
    r2 (SIZE '2'),
    r3 (SIZE '2-2'),
    r4 (SIZE '4-4')
);
SET cluster = test;

CREATE TABLE large (x int);
INSERT INTO large SELECT generate_series(999999999, 1000000100);

SET cluster_replica = r1;
SELECT chr(x) FROM large;
SET cluster_replica = r2;
SELECT chr(x) FROM large;
SET cluster_replica = r3;
SELECT chr(x) FROM large;
SET cluster_replica = r4;
SELECT chr(x) FROM large;

The issue is caused by our current policy to return a random error in a peek response, if the peeked collection contains multiple errors. Each replica does so and there is nothing that would ensure that they all pick the same one. When all replicas have the same worker count, compute's determinism ensures (should ensure?) that things work out, but that's not the case when replicas have different sizes.

It's not clear if this non-determinism causes any negative consequences or even correctness issues. Seems like a good idea to fix it in any case.

teskje commented 3 months ago

I see two general ways to fix this:

  1. Return all errors, not just a single one. Let compute clients decide how to further filter them down.
    • Has the potential to remove some complexity from the compute protocol as it makes error handling more similar to row handling.
    • For collections with a lot of errors, this has the unfortunate side effect of blowing up the size of responses for queries reading from those collections (instead of one error string they might contain a million). There is the risk that users see "max result size exceeded" errors instead of learning about the actual errors in their data.
  2. Pick errors in a non-random way instead of randomly. For example, the compute protocol could define that the smallest error, according to the lexicographic order on strings, must be returned.
    • Requires changes at least in the replica code that handles peeks and in PartitionedComputeState, both of which need to collect and sort errors to find the minimal one.
benesch commented 3 months ago

Pick errors in a non-random way instead of randomly. For example, the compute protocol could define that the smallest error, according to the lexicographic order on strings, must be returned.

This seems like the way to go to me!