TidierOrg / TidierDB.jl

Tidier database analysis in Julia, modeled after the dbplyr R package.
MIT License
35 stars 3 forks source link

Return a CTE expression to allow for chaining #13

Closed korenmiklos closed 1 month ago

korenmiklos commented 3 months ago

I understand the @show_query macro only shows the query, does not return anything. It would be great to return the SQL query to be used in CTE expressions in future queries. Something like:

cte = @chain db_table(db, :mtcars) begin
    @filter(!starts_with(model, "M"))
    @group_by(cyl)
    @summarize(mpg = mean(mpg))
    @query
end

df = @chain cte begin
    @mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    @filter(mpg_efficiency in ("moderate", "efficient"))
    @arrange(desc(mpg_rounded))
    @collect
end

I'd be happy to work on this if you give me some pointers.

kdpsingh commented 3 months ago

Thanks @korenmiklos. If you omit the @query from your code, this code should just work, where you can continue to build on the query. Is this sufficient?

cte = @chain db_table(db, :mtcars) begin
    @filter(!starts_with(model, "M"))
    @group_by(cyl)
    @summarize(mpg = mean(mpg))
    # @query
end

df = @chain cte begin
    @mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    @filter(mpg_efficiency in ("moderate", "efficient"))
    @arrange(desc(mpg_rounded))
    @collect
end
drizk1 commented 3 months ago

If the ideal/goal of chaining from a prior cte, is to be able to do multiple queries from it, then we're are halfway there.

In the third chain, you'll notice it contains the contents from the second chain which changed the content of the underlying CTE/SQLQuery struct/metadata. The third chain shows the new columns that were not present in the saved initial test object which would have the SQLquery and the metadata. (if you do not use @collect or @show_query you will see the metadata which is needed for various processes)

One way to enable this (and im open to others if you/others have ideas), would perhaps be to add a @from_cte macro. This would be used after @chain cte begin to create a copy of the object (SQLstruct, CTE struct and metadata) to allow for you to build multiple a query from it, without changing the original cte

or even a from_cte function similar to db_table which creates the cte copy

test = @chain db_table(con, :mtcars) begin
    @filter(!starts_with(model, "M"))
    @group_by(cyl)
    @summarize(mpg = mean(mpg))
end

df = @chain test begin
    @mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate")) 
    @collect
end

@chain test begin
    @show_query
end
WITH cte_1 AS (
SELECT *
        FROM mtcars
        WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
        FROM cte_1
        GROUP BY cyl),
cte_3 AS (
SELECT  mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 'efficient' WHEN mpg < 15.2 THEN 'inefficient' ELSE 'moderate' END AS mpg_efficiency
        FROM cte_2 )  
SELECT *
        FROM cte_3
drizk1 commented 3 months ago

@korenmiklos as a quick update/option, the function below would allow you to take the cte test above, and then run new queries on it, without changing the underlying struct/ctes/metadata.

function from_cte(object)
    return copiedobject = deepcopy(object)
end
@chain from_cte(test) begin
    @mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate")) 
    @collect
end

@chain from_cte(test) begin
    @mutate(mpg2 = mpg*2)
    @collect
end

# notice the test is unedited.
@chain test  @collect

there may be a more memory efficient way to do it without a deepcopy, however.

korenmiklos commented 3 months ago

Thanks for the clarification @drizk1 and @kdpsingh. I see now that CTE and SQLQuery are mutable structs.

Reusing SQL expressions could be a great use case for TidierDB. To exploit the performance of DuckDB, I write mostly SQL these days, but it's not easy to link pieces of SQL together. dbt does this, but is too heavy machinery and a different language.

TidierDB could serve as the glue in Julia. I guess this would need SQLQuery and CTE to allow for references, rather than strings. Say, from would be of type Union{CTE,Table}.

Is this within scope?

korenmiklos commented 3 months ago

@korenmiklos as a quick update/option, the function below would allow you to take the cte test above, and then run new queries on it, without changing the underlying struct/ctes/metadata.

function from_cte(object)
    return copiedobject = deepcopy(object)
end
@chain from_cte(test) begin
    @mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate")) 
    @collect
end

@chain from_cte(test) begin
    @mutate(mpg2 = mpg*2)
    @collect
end

# notice the test is unedited.
@chain test  @collect

there may be a more memory efficient way to do it without a deepcopy, however.

As long as only the CTE struct is deepcopied, memory should not be an issue.

drizk1 commented 3 months ago

To clarify, deepcopy(test) wouldn't copy the data, only the SQL query struct (which contains the metadata DF), CTE struct, and connection that the macros use to build queries. It wouldn't be of anything else.

Regarding the from being a union{CTE, table} , I'm not sure I fully understand. This would so that a portion of the query could be extracted? Does the method above different from the use case you're envisioning? I might not be fully grasping the union{CTE, table} idea but if there's a feasible way we can make TidierDB enhance your workflow even more then I'd love to do it

If the method above works, I think I know away to do it without using deepcopy

korenmiklos commented 3 months ago

Sorry, wasn't clear. I mean that the structs could store the table to operate on not as string, but as a reference to another struct. When the actual query has to be sent to the DB engine, a recursive function would evaluate these references to the actual query string.

In dbt, this is done by Jinja templates,

select * from {{ ref('other_table') }} limit 10

where other_table may be a table, but it also may be a CTE defined earlier.

Maybe none of this is necessary. I don't yet sufficiently understand the architecture.

Why does the second query modify the first? I understand that each command is a CTE and these are chained together. But why is the incoming CTE mutable?

kdpsingh commented 3 months ago

Thanks for the additional dialog. I think dbt and TidierDB have some similarities but also some crucial differences. Will weigh in soon with thoughts once I have some time to write them down.

drizk1 commented 3 months ago

I will have to do some more reading on dbt, because I am a bit unfamiliar with it.

The reason the second query above modified the test query is because it's referring to that specific iteration of the struct. Albeit, when I discovered this, I was a bit surprised, as I would've expected it to behave a bit more like chains with dataframes in TidierData, where the DF is not edited in place. But this may have to do with differences in struct behavior I am unfamiliar with

The main reason the incoming cte struct is mutable is that when I was building the package, I added the cte struct later as I realized I needed it to build more complex queries, so I followed the pattern of the original SQL query struct. It is possible that it does not need to be mutable, but I am not 100% sure.

drizk1 commented 2 months ago

Hi @korenmiklos , so I have put together a more memory friendly version that only uses deepcopy for the metadata dataframe. Otherwise, it works identically to the example above, allowing the user to reuse a query multiple times without altering the initial query they are building off of.

If this fits your workflow need (and then likely that of others as well), I think it is worth having as part of the TidierDB ecosystem, I will add docs and put in the next release, but first it needs a name.

I think perhaps from_query() makes the most sense since it builds off prior queries, but I am open to suggestions.

# test defined in example above
DB.@chain DB.from_query(test) begin
    DB.@mutate(mpg2 = mpg*2)
    DB.@collect
   #DB.@show_query
end

DB.@collect test # shows no change in initial query 

DB.@chain DB.from_query(test) begin
    DB.@mutate(mpg4 = mpg*4)
    DB.@collect
   # DB.@show_query
end 

DB.@collect test # shows no change in initial query 
DB.@show_query test 
drizk1 commented 1 month ago

I am going to close this for now now that from_query has been implemented. Feel free to reopen as/if needed.