Open dmmartins opened 1 month ago
Hi @dmmartins, this is not a bug. DuckDB inlines your CTE definition, resulting in a query that looks like this:
select "table1"."id", "table1"."name", "table2"."table1_id", "table2"."name"
from (select uuid() as "id", 'test' as "name") AS "table1"
join (select
"table1"."id" as "table1_id",
'test' as "name"
from (select uuid() as "id", 'test' as "name") AS "table1") AS "table2" on "table2"."name" = "table1"."name";
This query contains multiple different calls to uuid()
, hence your UUID
s are not the same.
What you likely want to do, is to use MATERIALIZED
CTEs:
with "table1" as materialized (
select uuid() as "id",
'test' as "name"
), "table2" as materialized (
select
"table1"."id" as "table1_id",
'test' as "name"
from "table1"
)
select "table1"."id", "table1"."name", "table2"."table1_id", "table2"."name"
from "table1"
join "table2" on "table2"."name" = "table1"."name";
This should implement the logic you are trying to express.
@kryonix does it make sense to perhaps automatically materialize the CTE if a function with FunctionStability::VOLATILE
is found in the body?
Or is that sometimes not desired you think?
I think such a heuristic would make a lot of sense! Having a MATERIALIZED
CTE is clearly the intention here—and maybe in general when using volatile functions. With PR #12290, this should not be hard to implement at all.
Also, if a user strictly wants to have the CTE not materialized (aka. have the behavior of the original query here), this would still be possible (using NOT MATERIALIZED
). This would also fix #9940 (assuming random
is marked as VOLATILE
).
This would also fix #9940 (assuming
random
is marked asVOLATILE
).
It is 👍
We talked a little about it, but Laurens's automatic materialization acts before binding, and that information is not available before binding the body, so I don't think this is feasible currently.
Well.. That's unfortunate. I was hoping that we could just piggyback on @lnkuiper's PR 😅 That would have been awesome, and easy. I still think that such a heuristic should exist in the system 🤔 (Again, repeating myself here, but ideally having just a single place where CTEs get automatically materialized).
With @lnkuiper we at some point discussed going even further: forbidding CTE with no explicit materialization strategy IF any invoked function is VOLATILE
.
Rationale would be that materialization strategy does matter, and user needs to be explicit. So either MATERIALIZED
or NOT MATERIALIZED
would need to be provided.
Rationale would be that materialization strategy does matter, and user needs to be explicit. So either MATERIALIZED or NOT MATERIALIZED would need to be provided.
I could see that work. This would be a perfect place to throw a warning—if that would be something databases did 😄
I did think about this some more. @lnkuiper's PR introduces CTE materialization if certain conditions are met (before binding completes). But maybe this could be flipped? My thinking is, what if all CTEs would be handled as if they were materialized at first. And only after binding we remove and inline exactly those CTEs that either (1) explicitly request to not be materialized, or (2) satisfy some arbitrary criteria (e.g. do not contain VOLATILE functions, etc.). I'm not entirely sure if (and how well) this fits into the current way things are handled (i.e. is it possible to inline the CTEs after the fact without breaking a bunch of things in later binding/logical/physical planning), but maybe that could be a way to enable us to express much more intricate CTE inlining, or materialization strategies.
What happens?
When I have a CTE that sets
uuid() as "id"
, the value of the column"id"
changes every time I select from that tableTo Reproduce
This query
return different
"id"
and"table1_id"
OS:
macOS 13.6.7 (22G720) aarch64
DuckDB Version:
v1.0.0 1f98600c2c
DuckDB Client:
cli
Full Name:
Diego Manenti Martins
Affiliation:
Stem
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?