TidierOrg / TidierDB.jl

Tidier database analysis in Julia, modeled after the dbplyr R package.
MIT License
35 stars 3 forks source link

Snowflake #5

Closed mrufsvold closed 2 weeks ago

mrufsvold commented 3 months ago

Hi! I use a lot of Snowflake with my work, and I'd love to contribute a Snowflake Mode to TidierDB.jl. Could someone give me a high level of what needs to be implemented to support a new backend? I'll also read the current code, but it would be awesome to have some context before diving in.

drizk1 commented 3 months ago

This would be incredible! I wanted to add it but have never established a connection.

Queries and CTEs are stored in structs and a string is built thru the macros and expr_to_sql parsers. A cte counter is used to order CTEs.

The functions needed to support dialect are as follows.

You'll notice some of them benefit from helper functions (mssql floor_date for example but most of duckdb does not). So you would take the parsing_dialect.jl document of the most similar dialect, make a copy as your base and make the changes there.

You may need to add minor support in finalize_query and collect and copy_to and that's basically about it.

Then set_sql_mode changes the expr_to_sql in all the macros and it's good to go

And then test it on your w the connection to make sure it works.

In terms of the dependencies, I'm going to work to make them lighter and contingent so will get back to you about that soon

Of course let me know if you have any questions.

mrufsvold commented 3 months ago

Cool. I'll look at Snowflake's Python connector to figure out how they handle collect and copy_to. The rest I feel pretty good about.

Obligatory OSS "I'm very busy and not sure how long this will take" disclaimer, but I'm on the case!

drizk1 commented 3 months ago

No worries about timing!

As far as copy and collect go: I only have implemented 3 copy_to backends (duckdb being the most graceful in enabling a variety), but if that's something that's quite important for your workflow, I'd love it.

Regarding collect, most of the backends support DBinterface.execute(con, query). From what I can tell, it looks like snowflake is used thru the ODBC.jl package, which I think would allow for relatively straightforward implementation of collect for snowflake, but i will have to double check that.

mrufsvold commented 3 months ago

I'm not super familiar with ODBC. If I'm tracking correctly, the user needs to install the driver themselves and then pass the driver path to ODBC.jl when creating a connection.

That's easy enough to do, but it isn't quite as "batteries included" as the other backends. (If my understanding is correct).

One thought I had was to use ODBC by default. But also create a separate package which lightly wraps the Python snowflake-connector, and a Pkg extension for TidierDB.jl which switches out the Snowflake Mode to use snowflake-connector when it is loaded. This would allow users to manage the dependencies automatically within Julia (at the expense of a pretty long install time).

The Python packages also has nice features like fetch_arrow which automatically returns Apache Arrow tables so we wouldn't pay to covert to a Julia DataFrame.

Edit: Reading up on The Snowflake REST API. I think that might be the right solution for a dependency free solution. But creating a package to handle that would be non-trivial, so I think ODBC.jl is the way to go in the short term.

drizk1 commented 3 months ago

So outside of duckdb, the user would load the library in they need for a connection, ie

import TidierDB as DB
using Clickhouse/MySQL, ODBC etc
conn = conn_fucntion("conn_string")

and thats the entirety of what the package is loaded in for by the user. Making a unified connection function is something that definitely might happen down the road to remove the need for the using OtherPkg part.

Regarding ODBC.jl, there is not much out there for snowflake beyond this (although this could just be someones one off experience) . The idea of using Python for the connector is definitely interesting, although I less familiar with the nuance of leveraging python for a julia package, so my inclination is to use ODBC.jl for now as well. Perhaps @kdpsingh can better comment on that.

As far as dependencies go, the plan is to shift all connection dependencies except DuckDB (MySQL, SQLite, ODBC, LibPQ, and Clickhouse) to become package extensions to lighten up the dependencies load rn.

kdpsingh commented 3 months ago

Thanks @drizk1. I would also say there's no rush to move to package extensions if the current set-up is working.

We can worry about that later. It's nice that TidierDB comes batteries included, and we can explore lightening it as a later priority.

mrufsvold commented 3 months ago

I've used PythonCall.jl before. The experience is very smooth imo, but the install process sets up its own version of python, conda, and then all your python dependencies. This takes a long time.

I think rolling it up in TidierDB would be very costly to the UX, so I think it should wait until you're ready to consider extensions. That will let most people opt out, and for those that use Snowflake, it will feel like a separate snowflake package's fault, not TidierDB 😅

Regarding ODBC.jl, there is not much out there for snowflake beyond this (although this could just be someones one off experience)

I found this. No updates in 4 years, and not indication of their speed experience, but still a good starting place to try it out.

kdpsingh commented 3 months ago

Yeah I would advocate against us including either PythonCall or RCall as a dependency. If the ODBC.jl package isn't sufficient, someone probably needs to make/update the Snowflake-specific Julia package.

drizk1 commented 3 weeks ago

Alright so just a quick update here.

I set up a free account w snowflake and spent a surprising amount of time figuring out OAuth token retrieval and permissions.. but finally I made a simple api that lets me query snowflake from Julia with just HTTP.jl and JSON3.jl

I'll hopefully have a branch up in the next day or 2.

My big questions is, in terms of workflow, are OAuth tokens regularly used? Or is there a preferred method (I couldn't get a JWT to work) ?

drizk1 commented 3 weeks ago

Howdy @mrufsvold

v.1.9 is on its way to the registry which includes Snowflake Support. I went full batteries included mode wrote an interface for the Snowflake SQL Rest API, thank you for pointing me in that direction. (The pain of driver paths was becoming overwhelming)

I put together some docs on it. But, since I am not a snowflake user, I would love for you to try it out and share your input/thoughts/experienced friction.

I am not sure what is a better way around the Oauth Token besides refreshing it in postman, so I am open to any suggestion/contribution you may have. I think an update connection function that takes the con and a new token if needed to update connection directly in a chain might be a good idea, but I'm not sure if that's something that would actually be valuable in a workflow

Hope it works smoothly for you. Looking forward to your thoughts!

mrufsvold commented 3 weeks ago

Work's been crazy, so sorry I didn't make progress, but great to hear you've made progress!

Yes, I can give it a spin next week!

drizk1 commented 3 weeks ago

No worries at all!! Sorry if I scooped it from you. I just have a nice lull before things get crazy Monday and I wanted to learn something new. No rush at all!

drizk1 commented 2 weeks ago

I'm going to close this issue for now since we have support for it, but of course reopen if needed / open new issues for snowflake / anything else.