cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

sql: possibly incorrectly optimizing away one of the projections #124695

Open AllyMarthaJ opened 4 months ago

AllyMarthaJ commented 4 months ago

Describe the problem

I have a table with two columns of relevance: a column with an enum type, and a bytes column.

I'm doing a conditional update on multiple rows of this table, using CASE ... WHEN, and part of that includes values which are NULL.

Unfortunately, in the instance I'm using, updating both the bytes and enum column with NULL values: this types the latter NULL as the enum column and so a mismatched type error is produced.

e.g. ERROR: value type debug_t doesn't match type bytes of column "bytes_col"

To Reproduce

Set up a cockroach cluster per your own wishes, and run the following queries to set up a table and type:

CREATE TYPE IF NOT EXISTS debug_t AS ENUM('test1', 'test2');

CREATE TABLE debug (
    id INT NOT NULL,
    enum_col debug_t DEFAULT NULL,
    bytes_col BYTES DEFAULT NULL,
    PRIMARY KEY(id)
);

INSERT INTO debug ("id", "enum_col", "bytes_col") VALUES (1, 'test1', 'some_bytes_yo');

Then observe that this query does work:

UPDATE debug SET "enum_col"=NULL, "bytes_col"=NULL WHERE "id" IN (1);

But this query, despite being valid syntax, does not work and raises ERROR: value type debug_t doesn't match type bytes of column "bytes_col":

UPDATE debug SET
    "enum_col" = (CASE "id"
                    WHEN 1 THEN
                        NULL
                    ELSE
                        NULL
                    END),
    "bytes_col" = (CASE "id"
                    WHEN 1 THEN
                        NULL
                    ELSE
                        NULL
                    END)
WHERE "id" IN (1);

Now, when the first NULL is updated to any other valid enum value, the query works. Likewise, casting the third NULL using CAST(NULL AS BYTES) also works.

Expected behavior See above.

If applicable, add screenshots to help explain your problem.

Environment: Cockroach DB version v23.1.14

Additional context This seems to impact PostgreSQL as well.

Jira issue: CRDB-39015

blathers-crl[bot] commented 4 months ago

Hi @AllyMarthaJ, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

yuzefovich commented 4 months ago

Thanks for the report!

IIUC this is the expected behavior, and the error contains a hint on how to go around it:

ERROR: value type debug_t doesn't match type bytes of column "bytes_col"
SQLSTATE: 42804
HINT: you will need to rewrite or cast the expression

I think what happens is that we have two untyped expressions, both of them having NULL value, and we perform the type checking on the first one (get debug_t type) and automatically apply the same type to the second expression.

If we add an explicit cast to either of the expressions, then it works:

UPDATE
    debug
SET
    enum_col = (CASE id WHEN 1 THEN NULL ELSE NULL END)::debug_t,
    bytes_col = (CASE id WHEN 1 THEN NULL ELSE NULL END)
WHERE
    id IN (1);
-- or
UPDATE
    debug
SET
    enum_col = (CASE id WHEN 1 THEN NULL ELSE NULL END),
    bytes_col = (CASE id WHEN 1 THEN NULL ELSE NULL END)::BYTES
WHERE
    id IN (1);

I'll close this as behaving as expected -- cc @cockroachdb/sql-queries in case my understanding of type checking is incorrect.

AllyMarthaJ commented 4 months ago

Hmm, that makes sense as an explanation, but I fail to understand why applying the same type to the second expression is expected behaviour! Especially since that the cast needn't happen on the entire expression — just one of the values inside a when is sufficient.

Ideally I wouldn't have to cast either expression, but maybe my understanding of types in the context of postgres/crdb is wrong.

Thanks for taking a look regardless 🙂

yuzefovich commented 4 months ago

It seems that postgres requires an explicit cast for both:

yuzefovich=# CREATE TYPE debug_t AS ENUM('test1', 'test2');
CREATE TYPE
yuzefovich=# CREATE TABLE debug (                          
    id INT NOT NULL,
    enum_col debug_t DEFAULT NULL,
    bytes_col TEXT DEFAULT NULL,
    PRIMARY KEY(id)
);

INSERT INTO debug ("id", "enum_col", "bytes_col") VALUES (1, 'test1', 'some_bytes_yo');
CREATE TABLE
INSERT 0 1
yuzefovich=# UPDATE
        debug
SET
        enum_col = (CASE id WHEN 1 THEN NULL ELSE NULL END),
        bytes_col = (CASE id WHEN 1 THEN NULL ELSE NULL END)       
WHERE
        id IN (1);
ERROR:  column "enum_col" is of type debug_t but expression is of type text
LINE 4:  enum_col = (CASE id WHEN 1 THEN NULL ELSE NULL END),
                     ^
HINT:  You will need to rewrite or cast the expression.

However, I took a closer look at what CRDB does, and I'm a bit concerned about what I'm seeing. In particular, it seems that in your example we actually optimize away the second CASE expression (because two CASE expressions are exactly the same), so we try to use the single CASE expression that is typed as debug_t for both columns, and that fails. Adding an explicit cast to either expression makes them distinct, so both CASE expressions end up being used.

I'm wondering whether we shouldn't be applying this optimization rule to remove one of the seemingly-redundant expressions depending on the context. I'll check with my colleagues.

yuzefovich commented 4 months ago

It looks like this behavior might not be easy to change, and it has existed for a long time. Also since this behavior matches postgres, we're going to treat this as "enhancement", but it's unlikely we'll get to it soon.