TidierOrg / TidierDB.jl

Tidier database analysis in Julia, modeled after the dbplyr R package.
MIT License
35 stars 3 forks source link

Simplification of the open/connection interface #14

Closed camilogarciabotero closed 2 months ago

camilogarciabotero commented 3 months ago

Hi all,

This package has so much potential! I want to open a rather late discussion about how the interface to different DBs backends could be more "julian" and maybe friendly. I reckon the idea to stick the most to the "tidyverse" style, but I believe this could also lead to better and refinement "tidy" way.

Currently, connections to the different DBs backends work as:

mem = DB.<backend>_open(":memory:");

Right?

So what if, instead, there is a higher method that dispatch according to a backend argument (or keyword arg):

mem = DB.open(":memory:"; backend=<backend>::Backend, kwargs...);

Then we can expand the backend types, but consistently keep one single function name with $n$ backend methods?

Does this make sense, what do you think about this alternative?

drizk1 commented 3 months ago

Creating a more unified backend approach is a great idea. Not late at all.

Technically, the only backend that connects with the open_duckdb method is duckdb, and I did that partially for convenience with docstrings to keep moving the package forward.

All the other backends (except SQLite) are as follows:

using Clickhouse/MySQL/ODBC/LibPQ
con = Clickhouse.connect(connection_string) # ie username, password, etc 

So the user is actually having to load the specific package they need and set up the connection.

That being said, unifying how the connect is made would be lovely so one function would be used for all and the the user wouldn't have to load the libraries anymore (although duckdb and SQLite are a bit different because they're local unlike the others)

Is that along the lines of what you were thinking?

camilogarciabotero commented 3 months ago

Is that along the lines of what you were thinking?

Right, so a couple of cases came also to my mind:

  1. the connect method, for instance:
db = DB.duckdb_connect(mem);

Could be something like:

db = DB.connect(mem; backend::Backend=duckdb, kwargs...);
  1. the table method:
@chain DB.db_table(db, :mtcars) begin
    ...
end

Could simply be:

@chain DB.table(db, :mtcars) begin
    ...
end

It already has DB as the import

What do you think?

drizk1 commented 3 months ago

I like the example of the connection one, I think it's definitely something we could do and then ppl can choose to use the method (original vs Julian) if they want. It wouldn't be a heavy lift to do.

For db_table, im inclined to leave it as it is to keep it district from Tables.jl / folks might be using TidierDB independently without the DB. prefix. That being said I'm open.

vituri commented 2 months ago

Still on this topic: in R I always use the pool package to manage connections: it auto-creates more connections, validate and so on. There is a "factory" that you pass to generate the connections and checks to be made.

Currently with TidierDB I wrote a simple macro that create a temporary connection and kills it after the collect. It is good enough for me, and fast enough for what I do.

drizk1 commented 2 months ago

I'm not familiar with that package. I will take a look at it.

Would you might sharing an example with the macro you created? I am not sure killing a connection after the collect is a very common part of workflows, but it sounds interesting.

Here are few examples of the connect function I currently have an a PR I have yet to merge

conn = connect(:duckdb)
conn = connect(:mysql; host="localhost", user="root", password="password", db="mydb")
vituri commented 2 months ago

When using MariaDB without a pool of connections, it is often the case that the connection died and the collect results in an error. So a safe way to use it is to connect-collect-disconnect. I do this mainly in REST APIs.

Suppose we have the function

function connect_mariadb(db = "your_database")
    con = DBInterface.connect(
        MySQL.Connection(...

We can define a macro that substitutes the "con" string with a temporary connection that will be killed after use:

macro db_temp(db::AbstractString, expr)
    @eval begin        
        @info "Conectando..."
        local con = connect_mariadb($db)

        x = $expr

        @info "Desconectando..."
        DBInterface.close!(con)

        return x
    end
end;

macro db_temp(expr)
    @eval @db_temp("my_most_common_db", $expr)
end

So if I want to write in mariadb, something like this works fine:

@db_temp write_mariadb(my_data_frame, "table_name", con, comando = "REPLACE")

or some TidierDB pipeline:

dados = 
    @db_temp @chain begin
    DB.db_table(con, :Resumo) 
    DB.@filter(Dia == "2024-04-10")
    DB.@group_by(Operação)
    DB.@summarize(Total = sum(Total))
    DB.@mutate(Média = mean(Total))
    DB.@filter(Média >= 0.1)
    DB.@arrange(desc(Operação))
    @aside DB.@show_query(_)
    DB.@collect
end
drizk1 commented 2 months ago

I took a look at the pool documentation and dbpool, I think if you wanted to build something analogous to that and along the lines of what you have been using for mariadb above, it would be fine to bring that into tidierdb, especially if that is something you and others would find valuable in database workflows.

then users will have access to the connect function as well as this other additional method and they can decide what is best for them.