MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Gracefully handle invalid function calls #6367

Open benesch opened 3 years ago

benesch commented 3 years ago

Consider a source in Materialize that has a text column that is sometimes JSON and sometimes not. If the user creates a view that parses this JSON

CREATE SOURCE src FROM ... FORMAT BYTES
CREATE MATERIALIZED VIEW v AS SELECT text::json FROM src

queries on that view like SELECT * FROM v will fail with "invalid json", without telling you which record had invalid JSON, or allowing you to ignore that one record.

One proposal, via @justinj, is to import Cockroach's IFERROR function (https://github.com/cockroachdb/cockroach/pull/25304), which makes it possible to inspect the errors from SQL.

ggevay commented 1 year ago

Requires also a bit of optimizer thinking as noted here, ~but should be possible to implement~. Actually, this seems quite hard, see below.

ggevay commented 1 year ago

We might want to ban subqueries, as explained in the slack thread.

Additionally, we'll have to make sure that no optimization will move the erroring part of the scalar expression from inside the iferror into a different operator. This is because if the erroring part of the scalar expression ends up in a different operator, then the scalar error would be converted to a dataflow error before reaching iferror, and iferror can only handle scalar errors.

For example, MapFilterProject.optimize can change an MFP quite a bit; certainly enough that the erroring part of the argument expression of an IFERROR call gets into a different scalar expression inside the same MFP. This is not a problem in itself, but might make it easier for other optimizations to move that expression around. And then there are various other optimizations which move scalar expressions around, e.g., inlining them into expressions in later operators or pushing Filter predicates downward. Also, some parts of some MFPs get absorbed into neighboring operators during lowering to LIR, e.g., an MFP on top of a Join can be broken down into several MFPs, which are applied at various stages of the join.

Related to the above, I'm not sure we can handle errors coming from aggregations: If we have IFERROR(SUM(x/0)), then the x/0 will be executed in a different operator from the IFERROR, so we have the same problem as above (but without even any optimizations).

ggevay commented 1 year ago

I have an idea for solving the "optimizer moving things around problem" from above. It's a variation of this idea. We could add a new field called dont_error to the CallUnary, CallBinary, CallVariadic MirScalarExpr variants, which would cause the eval function to catch an EvalError and convert it into a special Datum variant called something like error_that_will_be_caught_by_iferror. This would have similar semantics to null, but the standard null-handling functions (e.g., IS NULL) don't recognize it, so it can just freely propagate through expression evaluations until it is caught. Then, we could add a special pass somewhere in HIR that would set this field inside all function calls inside an IFERROR argument expression. This would mean that no matter where the various subexpressions of the argument expression of IFERROR end up, they wouldn't produce errors, but just this special datum, which would eventually reach the IFERROR, which would handle it appropriately. (Note that we can't simply use null instead of this special datum, because the argument expression of IFERROR might have an IS NULL or other null handling function somewhere, which would spuriously catch it before the null would reach IFERROR.)

This is not solving the problem with the Filter error("more than one record produced in subquery") pattern, so we would have the same limitation as Cockroach has, but it might be solving all the other problems from above. (Aggregation functions would need to handle this special datum.)

benesch commented 1 year ago

Thanks for thinking about this and writing these thoughts up, @ggevay!

Recording some thoughts from a Slack discussion.

The complexity you've uncovered makes me think there are dragons here that we might prefer to avoid.

As an alternative to a general iferror function, we could instead introduce try_ variants of each function that return NULL instead of erroring. For example, try_to_timestamp would be the error-free variant of to_timestamp; try_cast would be the error-free variant of cast.

The try_ approach is appealing because it is not a general-purpose exception handling mechanism. You don't need to worry about handling an error that occurred deep inside an expression (e.g., iferror(fn1(fn2(error_producing_function()))))—the try-able semantics apply only to specific function.

Open question: is try_ flexible enough? The described semantics mean that the try_ variant of each function replaces errors with NULL. But users might need to replace only some errors with NULL, or distinguish between the function returning NULL without erroring, and the function erroring. try_ won't enable any of these use cases.

ggevay commented 1 year ago

The try_ variant approach sounds good to me.

Open question: is try flexible enough? The described semantics mean that the try variant of each function replaces errors with NULL. But users might need to replace only some errors with NULL, or distinguish between the function returning NULL without erroring, and the function erroring. try_ won't enable any of these use cases.

We could do something like Cockroach's IFERROR with some extra arguments:

This would make the implementation a slightly bit more complex, because the arity of these functions would be different from the originals, but I think it's not too bad.

benesch commented 1 year ago

Yeah, I was thinking along those lines, but I'm worried that it'll read poorly. E.g. consider make_timestamp and a hypothetical try_make_timestamp:

SELECT make_timestamp(2020, 10, 25, 9, 30, 17.12);
SELECT try_make_timestamp(2020, 10, 25, 9, 30, 17.12, NULL);

It's already unreadable soup with the 6 arguments. The NULL doesn't obviously read as "use this on error."

We might also want to consider syntax like:

SELECT make_timestamp(2020, 10, 25, 9, 30, 17.12 IF ERROR WHEN <code> THEN <expr> WHEN <code> THEN <expr> ELSE <expr>);

Idk, that's ugly too, but maybe reads a bit better and looks like the existing CASE syntax and a little bit like CAST(... AS ...) and EXTRACT( ... FROM ) where SQL bends the usual rules for function call parameters.

Personal take: I think the proposal where try_ very simply replaces all errors with NULLs is useful enough to be worth the simple and clear syntax, even though it is limited in functionality. If we find users asking to extend the semantics of try_ functions (e.g., changing the value on error, or only handling errors with certain codes), at that point I think we should consider dedicated syntax.

mgree commented 2 months ago

A general principle in PL design is that for any partial operation, there should exist either (a) an easy way to predict whether or not it will succeed or (b) a way to handle its failure. TRYCAST is (b), but I can imagine doing (a) instead: go through each *Func in ScalarExpr that can fail and adding metadata to say what its "success predicate" is---and then explicitly mention these in the docs.

E.g., the success predicate for division on integer types is that the second number is not 0; the success predicate for square root is that its input is a non-negative number; the success predicate for casting a number to text is matching some fixed regexp.

benesch commented 2 months ago

E.g., the success predicate for division on integer types is that the second number is not 0; the success predicate for square root is that its input is a non-negative number; the success predicate for casting a number to text is matching some fixed regexp.

How would you describe the success predicate for something like casting timestamptz? The supported formats are much more complex than what a regex can describe.

mgree commented 2 months ago

That is a messy one! I think most languages would revert to an error-handling solution. If for some reason we wanted to not bother with a full TRYCAST, you could write a custom TRYCAST for such individual cases. Probably easier to just do TRYCAST.