Closed korenmiklos closed 1 month ago
Thanks @korenmiklos. If you omit the @query
from your code, this code should just work, where you can continue to build on the query. Is this sufficient?
cte = @chain db_table(db, :mtcars) begin
@filter(!starts_with(model, "M"))
@group_by(cyl)
@summarize(mpg = mean(mpg))
# @query
end
df = @chain cte begin
@mutate(mpg_squared = mpg^2,
mpg_rounded = round(mpg),
mpg_efficiency = case_when(
mpg >= cyl^2 , "efficient",
mpg < 15.2 , "inefficient",
"moderate"))
@filter(mpg_efficiency in ("moderate", "efficient"))
@arrange(desc(mpg_rounded))
@collect
end
If the ideal/goal of chaining from a prior cte, is to be able to do multiple queries from it, then we're are halfway there.
In the third chain, you'll notice it contains the contents from the second chain which changed the content of the underlying CTE/SQLQuery struct/metadata. The third chain shows the new columns that were not present in the saved initial test
object which would have the SQLquery and the metadata. (if you do not use @collect
or @show_query
you will see the metadata which is needed for various processes)
One way to enable this (and im open to others if you/others have ideas), would perhaps be to add a @from_cte
macro. This would be used after @chain cte begin
to create a copy of the object (SQLstruct, CTE struct and metadata) to allow for you to build multiple a query from it, without changing the original cte
or even a from_cte
function similar to db_table
which creates the cte copy
test = @chain db_table(con, :mtcars) begin
@filter(!starts_with(model, "M"))
@group_by(cyl)
@summarize(mpg = mean(mpg))
end
df = @chain test begin
@mutate(mpg_squared = mpg^2,
mpg_rounded = round(mpg),
mpg_efficiency = case_when(
mpg >= cyl^2 , "efficient",
mpg < 15.2 , "inefficient",
"moderate"))
@collect
end
@chain test begin
@show_query
end
WITH cte_1 AS (
SELECT *
FROM mtcars
WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
FROM cte_1
GROUP BY cyl),
cte_3 AS (
SELECT mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 'efficient' WHEN mpg < 15.2 THEN 'inefficient' ELSE 'moderate' END AS mpg_efficiency
FROM cte_2 )
SELECT *
FROM cte_3
@korenmiklos as a quick update/option, the function below would allow you to take the cte test
above, and then run new queries on it, without changing the underlying struct/ctes/metadata.
function from_cte(object)
return copiedobject = deepcopy(object)
end
@chain from_cte(test) begin
@mutate(mpg_squared = mpg^2,
mpg_rounded = round(mpg),
mpg_efficiency = case_when(
mpg >= cyl^2 , "efficient",
mpg < 15.2 , "inefficient",
"moderate"))
@collect
end
@chain from_cte(test) begin
@mutate(mpg2 = mpg*2)
@collect
end
# notice the test is unedited.
@chain test @collect
there may be a more memory efficient way to do it without a deepcopy, however.
Thanks for the clarification @drizk1 and @kdpsingh. I see now that CTE
and SQLQuery
are mutable structs.
Reusing SQL expressions could be a great use case for TidierDB. To exploit the performance of DuckDB, I write mostly SQL these days, but it's not easy to link pieces of SQL together. dbt does this, but is too heavy machinery and a different language.
TidierDB could serve as the glue in Julia. I guess this would need SQLQuery
and CTE
to allow for references, rather than strings. Say, from
would be of type Union{CTE,Table}
.
Is this within scope?
@korenmiklos as a quick update/option, the function below would allow you to take the cte
test
above, and then run new queries on it, without changing the underlying struct/ctes/metadata.function from_cte(object) return copiedobject = deepcopy(object) end
@chain from_cte(test) begin @mutate(mpg_squared = mpg^2, mpg_rounded = round(mpg), mpg_efficiency = case_when( mpg >= cyl^2 , "efficient", mpg < 15.2 , "inefficient", "moderate")) @collect end @chain from_cte(test) begin @mutate(mpg2 = mpg*2) @collect end # notice the test is unedited. @chain test @collect
there may be a more memory efficient way to do it without a deepcopy, however.
As long as only the CTE
struct is deepcopied, memory should not be an issue.
To clarify, deepcopy(test)
wouldn't copy the data, only the SQL query struct (which contains the metadata DF), CTE struct, and connection that the macros use to build queries. It wouldn't be of anything else.
Regarding the from
being a union{CTE, table}
, I'm not sure I fully understand. This would so that a portion of the query could be extracted? Does the method above different from the use case you're envisioning? I might not be fully grasping the union{CTE, table}
idea but if there's a feasible way we can make TidierDB enhance your workflow even more then I'd love to do it
If the method above works, I think I know away to do it without using deepcopy
Sorry, wasn't clear. I mean that the structs could store the table to operate on not as string, but as a reference to another struct. When the actual query has to be sent to the DB engine, a recursive function would evaluate these references to the actual query string.
In dbt
, this is done by Jinja templates,
select * from {{ ref('other_table') }} limit 10
where other_table
may be a table, but it also may be a CTE defined earlier.
Maybe none of this is necessary. I don't yet sufficiently understand the architecture.
Why does the second query modify the first? I understand that each command is a CTE and these are chained together. But why is the incoming CTE mutable?
Thanks for the additional dialog. I think dbt and TidierDB have some similarities but also some crucial differences. Will weigh in soon with thoughts once I have some time to write them down.
I will have to do some more reading on dbt, because I am a bit unfamiliar with it.
The reason the second query above modified the test
query is because it's referring to that specific iteration of the struct. Albeit, when I discovered this, I was a bit surprised, as I would've expected it to behave a bit more like chains with dataframes in TidierData, where the DF is not edited in place. But this may have to do with differences in struct behavior I am unfamiliar with
The main reason the incoming cte struct is mutable is that when I was building the package, I added the cte struct later as I realized I needed it to build more complex queries, so I followed the pattern of the original SQL query struct. It is possible that it does not need to be mutable, but I am not 100% sure.
Hi @korenmiklos , so I have put together a more memory friendly version that only uses deepcopy
for the metadata dataframe. Otherwise, it works identically to the example above, allowing the user to reuse a query multiple times without altering the initial query they are building off of.
If this fits your workflow need (and then likely that of others as well), I think it is worth having as part of the TidierDB ecosystem, I will add docs and put in the next release, but first it needs a name.
I think perhaps from_query()
makes the most sense since it builds off prior queries, but I am open to suggestions.
# test defined in example above
DB.@chain DB.from_query(test) begin
DB.@mutate(mpg2 = mpg*2)
DB.@collect
#DB.@show_query
end
DB.@collect test # shows no change in initial query
DB.@chain DB.from_query(test) begin
DB.@mutate(mpg4 = mpg*4)
DB.@collect
# DB.@show_query
end
DB.@collect test # shows no change in initial query
DB.@show_query test
I am going to close this for now now that from_query
has been implemented. Feel free to reopen as/if needed.
I understand the @show_query macro only shows the query, does not return anything. It would be great to return the SQL query to be used in CTE expressions in future queries. Something like:
I'd be happy to work on this if you give me some pointers.