duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
21.47k stars 1.73k forks source link

CTE column value changes every time the table is used #12875

Open dmmartins opened 1 month ago

dmmartins commented 1 month ago

What happens?

When I have a CTE that sets uuid() as "id", the value of the column "id" changes every time I select from that table

To Reproduce

This query

with "table1" as (
    select uuid() as "id",
    'test' as "name"
), "table2" as (
    select
        "table1"."id" as "table1_id",
        'test' as "name"
    from "table1"
)
select "table1"."id", "table1"."name", "table2"."table1_id", "table2"."name"
from "table1"
join "table2" on "table2"."name" = "table1"."name";

return different "id" and "table1_id"

OS:

macOS 13.6.7 (22G720) aarch64

DuckDB Version:

v1.0.0 1f98600c2c

DuckDB Client:

cli

Full Name:

Diego Manenti Martins

Affiliation:

Stem

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

kryonix commented 4 weeks ago

Hi @dmmartins, this is not a bug. DuckDB inlines your CTE definition, resulting in a query that looks like this:

select "table1"."id", "table1"."name", "table2"."table1_id", "table2"."name"
from (select uuid() as "id", 'test' as "name") AS "table1"
join (select
        "table1"."id" as "table1_id",
        'test' as "name"
    from (select uuid() as "id", 'test' as "name") AS "table1") AS "table2" on "table2"."name" = "table1"."name";

This query contains multiple different calls to uuid(), hence your UUIDs are not the same.

What you likely want to do, is to use MATERIALIZED CTEs:

with "table1" as materialized (
    select uuid() as "id",
    'test' as "name"
), "table2" as materialized (
    select
        "table1"."id" as "table1_id",
        'test' as "name"
    from "table1"
)
select "table1"."id", "table1"."name", "table2"."table1_id", "table2"."name"
from "table1"
join "table2" on "table2"."name" = "table1"."name";

This should implement the logic you are trying to express.

Tishj commented 4 weeks ago

@kryonix does it make sense to perhaps automatically materialize the CTE if a function with FunctionStability::VOLATILE is found in the body? Or is that sometimes not desired you think?

kryonix commented 4 weeks ago

I think such a heuristic would make a lot of sense! Having a MATERIALIZED CTE is clearly the intention here—and maybe in general when using volatile functions. With PR #12290, this should not be hard to implement at all.

Also, if a user strictly wants to have the CTE not materialized (aka. have the behavior of the original query here), this would still be possible (using NOT MATERIALIZED). This would also fix #9940 (assuming random is marked as VOLATILE).

Tishj commented 4 weeks ago

This would also fix #9940 (assuming random is marked as VOLATILE).

It is 👍

Tishj commented 3 weeks ago

We talked a little about it, but Laurens's automatic materialization acts before binding, and that information is not available before binding the body, so I don't think this is feasible currently.

kryonix commented 3 weeks ago

Well.. That's unfortunate. I was hoping that we could just piggyback on @lnkuiper's PR 😅 That would have been awesome, and easy. I still think that such a heuristic should exist in the system 🤔 (Again, repeating myself here, but ideally having just a single place where CTEs get automatically materialized).

carlopi commented 3 weeks ago

With @lnkuiper we at some point discussed going even further: forbidding CTE with no explicit materialization strategy IF any invoked function is VOLATILE.

Rationale would be that materialization strategy does matter, and user needs to be explicit. So either MATERIALIZED or NOT MATERIALIZED would need to be provided.

kryonix commented 3 weeks ago

Rationale would be that materialization strategy does matter, and user needs to be explicit. So either MATERIALIZED or NOT MATERIALIZED would need to be provided.

I could see that work. This would be a perfect place to throw a warning—if that would be something databases did 😄


I did think about this some more. @lnkuiper's PR introduces CTE materialization if certain conditions are met (before binding completes). But maybe this could be flipped? My thinking is, what if all CTEs would be handled as if they were materialized at first. And only after binding we remove and inline exactly those CTEs that either (1) explicitly request to not be materialized, or (2) satisfy some arbitrary criteria (e.g. do not contain VOLATILE functions, etc.). I'm not entirely sure if (and how well) this fits into the current way things are handled (i.e. is it possible to inline the CTEs after the fact without breaking a bunch of things in later binding/logical/physical planning), but maybe that could be a way to enable us to express much more intricate CTE inlining, or materialization strategies.