TidierOrg / TidierDB.jl

Tidier database analysis in Julia, modeled after the dbplyr R package.
MIT License
35 stars 3 forks source link

Remove join symbols, adds full/semi/anti join, connect, adds `from_query`, adds athena joining support #15

Closed drizk1 closed 2 months ago

drizk1 commented 2 months ago

This PR

Before merging i wanted to run sql_agg by you, @kdpsingh , especially the name. By a stroke of luck when building Tidierdb, @summarize can handle any sql aggregate function as long as the sql syntax is correctly written regardless of which aggregate functions are built into the parsers, which makes it pretty flexible. However, due to SQL behavior with window functions, @mutate can do this for generic sql functions but not for aggregate functions .

sql_agg is is similar to dbplyrs sql(), which allows user to directly put a sql string into a query.. However, the main difference here is that sql_agg is only used in @mutate to give users flexibility to use any aggregate function (outside of those in the parsers) from any dialect in their query, while the available sql parsers expand. (duckdb and clickhouse alone have many agg functions for example)

So the two questions for sql_agg are: is there a preferable name to sql_agg, and should something like sql_agg be included

brief example

DB.@chain DB.db_table(con, :mtcars) begin
    DB.@group_by cyl
    DB.@mutate test = sql_agg("regr_intercept(mpg, hp)") 
    DB.@select test
    DB.@show_query
   # DB.@collect()
end
WITH cte_2 AS (
SELECT  model, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, regr_intercept(mpg, hp) OVER (PARTITION BY cyl ) AS test
        FROM mtcars)  
SELECT test
        FROM cte_2

connection examples (connect auto selects the parser as well)

conn = connect(:duckdb)
conn = connect(:mysql; host="localhost", user="root", password="password", db="mydb")
kdpsingh commented 2 months ago

Ooh I will take a look shortly.

kdpsingh commented 2 months ago

Will share feedback by tomorrow. Thanks for continued work on it.

drizk1 commented 2 months ago

all good! no rush! I'm just enjoying my free time

edit, was just resolving the conflicts

kdpsingh commented 2 months ago

Given how swamped I've been, I would suggest moving forward with all this.

My main question: is sql_agg() doing something different than sql() in dbplyr? If not, should we just call it sql()?

drizk1 commented 2 months ago

sql_agg and sql are different.

sql in dbplyr is used to directly inject a complete string in any function vs sql_agg only works with @mutate to allow any aggregate function to be used in mutate, not just those in the parsers. So it still interplays with the underlying parsers to correctly add windowing allowing it to work.

I havent added sql yet because I'm how to make it play nicely with the metadata (something that is not an issue here)

so the difference would be in dbplyr it would be

mtcars |>
   group_by(cyl) |> 
   mutate(sql("regr_intercept(mpg, hp) OVER (PARTITION BY cyl ) AS test")

vs here

DB.@chain DB.db_table(con, :mtcars) begin
    DB.@group_by cyl
    DB.@mutate test = sql_agg("regr_intercept(mpg, hp)")
kdpsingh commented 2 months ago

Got it. In that case, suggest moving forward!

kdpsingh commented 2 months ago

Does sql_agg insert raw strings in a way that can be used in other functions like summarize or filter? Or only mutate?

drizk1 commented 2 months ago

It throws an error if it's used with @summarize.

someone could use it elsewhere to inject a string elsewhere.

I could change this behavior and make it error if used elsewhere as well

drizk1 commented 2 months ago

So i think for now that I will just leave sql_agg stashed away in the parsers while i refine it and make it better, and will put it in the readme/news at a later. so I guess it would be an easter egg for anyone who has seen the commits/explore the code

kdpsingh commented 2 months ago

Another potential name for it would be sql_raw(). I think it's fine to leave it as an internal function as long as it's not intended for users to use it directly. But if you do export it, document it and just state its limitations.

drizk1 commented 2 months ago

sql_raw could work as well as a name. Its not as explicit, but I would be cool with that too.

For now I have commented it out of parsers, and will bring it back online when I add stricter erroring for it to force it to only be used in @mutate

On an unrelated note, I have added Athena joining support (tested successfully with self joins for now) and will merge and release shortly. It adds a slight syntax difference only for Athena Joins where there is an additional argument with the parameters (i am brainstorming ways to fix this but I wanted people who might need it to be able to use it now so it is covered in the docs)

below pulls from the docs covering the joins in athena

 @chain from_query(mtcars) begin
    @full_join(demodb.mtcars, cyl, cyl, athena_params)
    @group_by(efficiency)
    @summarize(avg_hp = mean(hp))
    @collect
 end