TidierOrg / TidierDB.jl

Tidier database analysis in Julia, modeled after the dbplyr R package.
MIT License
43 stars 3 forks source link

Support asof join types #6

Open era127 opened 5 months ago

era127 commented 5 months ago

It would be great if the joins supported the ASOF (specifically the left asof) joins as it is supported in duckdb. I know this is not standard sql but it’s quite useful for time series analysis.

drizk1 commented 5 months ago

Thank you for bringing this up!

We can definitely get this implemented based on the docs you shared.

right now, my thought is to make it's its own macro and support the additional arguments, but I'm open to making it part of left_join as well and use an as_of argument to specify the syntax.

kdpsingh commented 5 months ago

If we do this, we should do it within the joins (not as separate macros). Happy to talk offline about what the syntax should be on our end.

Tidyverse actually supports asof joins so it would be nice to keep this consistent with that syntax, which is incidentally very similar to DuckDB syntax.

kdpsingh commented 5 months ago

Here's the DuckDB syntax: https://duckdb.org/docs/guides/sql_features/asof_join.html

Here's the tidyverse syntax (see inequality joins and closest()): https://dplyr.tidyverse.org/reference/join_by.html

We don't have to use join_by() in TidierDB but we should think about how this would work.

drizk1 commented 5 months ago

Oh wow. I did not know that. I'll take a look at these docs as well and then reach out soon about what might be the best way to add this with tidyverse as our guide, @kdpsingh

era127 commented 5 months ago

Here is an older reference from before duckdb supported that join natively and the query was quite complex.

https://bwlewis.github.io/duckdb_and_r/asof/asof.html

drizk1 commented 5 months ago

This is just one potential syntax option. I am happy to explore other styles as well that might be more in line with Tidyverse style joins (changes to underlying code were minimal).

import TidierDB as DB
db = DB.duckdb_open(":memory")
con = DB.duckdb_connect(db)
DB.copy_to(con, "https://duckdb.org/data/prices.csv", "prices")
DB.copy_to(con, "https://duckdb.org/data/holdings.csv", "holdings")

@chain DB.db_table(con, :holdings) begin
    DB.@left_join(:prices, ticker, ticker, as_of = true, and_clause = "holdings.when >= prices.when")
    DB.@select(holdings.ticker, holdings.when) 
    DB.@mutate(value = price * shares)
    DB.@show_query
   # @collect
end
WITH cte_2 AS (
SELECT  holdings.ticker, holdings.when, price * shares AS value
        FROM holdings ASOF
        LEFT JOIN prices ON prices.ticker = holdings.ticker AND holdings.when >= prices.when)  
SELECT *
        FROM cte_2

i tried using the parenthesis method on the DuckDB website but it would not run locally.

kdpsingh commented 5 months ago

Nice work on this. I don't like this syntax. I'll share an alternative goal syntax.

I also want to remove the need for symbols from the joins. I know it's doable I think I know why you were running into issues.

drizk1 commented 5 months ago

sounds good, i revisited the duckdb code that I could not get to run yesterday and I wanted to clarify my fixed issue. the reason it did not run was because the column name is when. When i put it made it 'when' , the below code ran. However, the using keyword does not support table.name conventions (my initial approach). Looking forward to your syntax goal.

DuckDB.execute(con, "
SELECT ticker, h.when, price * shares AS value
FROM holdings h
ASOF JOIN prices p USING (ticker, 'when');") |> DataFrame
drizk1 commented 5 months ago

I was playing around with other possible approaches. here is another, albeit verbose, possible syntax. i also removed the need for symbols in joins. (the quotes are needed to avoid the parsing error for when in duckdb). This could surely be shifted to something even closer to join_by and inequality joins in tidyverse.

DB.@chain DB.db_table(con, :holdings) begin
    DB.@left_join(prices, as_of = true, using_columns = [ticker, "\"when\""])
    DB.@select(holdings.ticker, holdings.when) 
    DB.@mutate(value = price * shares)
    DB.@show_query
   # DB.@collect
end
WITH cte_2 AS (
SELECT  holdings.ticker, holdings.when, price * shares AS value
        FROM holdings ASOF
        LEFT JOIN prices USING (ticker, 'when'))  
SELECT *
        FROM cte_2
kdpsingh commented 5 months ago

Appreciate you looking at this, @drizk1. Sorry have been slow on my end -- will share ideal syntax shortly.

drizk1 commented 5 months ago

No rush at all. I just like to tinker in my free time