JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.73k stars 367 forks source link

Self documenting joins #2472

Open tk3369 opened 3 years ago

tk3369 commented 3 years ago

SQL can be self documenting in joins when you prefix each column. For example:

select a.x, b.y
from tab1 a
inner join tab2 b on b.id = a.id

In DataFrames.jl, it's unclear where the columns came from:

df = innerjoin(tab1, tab2, on = :id)
select!(df, :x, :y)

Was it discussed before how to make the code more self documenting?

A related issue is that I often have to limit the columns from the dataframes anyways before joining:

tab1_tmp = select(tab1, :id, :x)
tab2_tmp = select(tab2, :id, Not(3:10))
innerjoin(tab1_tmp, tab2_tmp, on = :id)

So maybe it's more convenient if I were able to do this in one go:

innerjoin(tab1, tab2, on = :id, left_columns = [:x], right_columns = [:y])

Thoughts?

bkamins commented 3 years ago

The current signature of e.g. innerjoin is:

innerjoin(df1, df2; on, makeunique = false, validate = (false, false), renamecols = identity => identity)

I think renamecols does what you want.

So maybe it's more convenient if I were able to do this in one go:

This can be discussed as a future addition. Let us wait for other users to comment here.

tk3369 commented 3 years ago

I don't want to rename the columns so renamecols is not actually helpful. My main concerns are:

  1. Is the code readable? Since the code does not specify where the column came from, it makes it difficult for someone who's doing code review for example.

  2. In most cases, I use joins to bring in specific columns from another data frame. Say, when I join orders and customers table, I may want to pull the name of the customers only. It's a little inconvenient to do that as two operations rather than a single one. This little inconvenience accumulates and becomes bigger inconvenience when I have to do the same thing many times.

Maybe these can be solved with a convenient frontend package? Not sure...

nalimilan commented 3 years ago

The problem is that if we start adding an argument to allow selecting columns in *join functions, we could also add lots of arguments to other methods. For example, an argument to select a subset of rows to *join and select/transform/combine would make sense, or an argument to group on a variable to select/transform/combine, etc. Having composable functions gives a much simpler API. And in terms of performance it's exactly the same.

To me

innerjoin(select(tab1, :id, :x, copycols=false),
          select(tab2, :id, :y, copycols=false)
          on = :id)

is as clean as:

innerjoin(tab1, tab2, on = :id, left_columns = [:x], right_columns = [:y])

One could argue that the former is easier to understand since select is more familiar to most users than left_columns/right_columns.

EDIT: granted, copycols=false is easy to forget, and it does make a performance difference.

bkamins commented 3 years ago

copycols=false

You can just write df[!, cols] of course which is not verbose.


I think the point is (and this is why I have not rejected the proposal as I initially had the same thought as you) - is as you have noted that you have to repeat :id column in both selects, but maye this is not that bad?

tbeason commented 3 years ago

Would be interesting to have a DataFramesSQL.jl package that just adds this syntax among other SQL clauses. I tend to find it too verbose for typical use-cases, but there are times when I would have made use of this.

pdimens commented 3 years ago

One (possibly outrageous) proposal would be to use the Pair syntax for the first two positional arguments, where the pair would be dataframe => :col (or [:col1, :col2, ...]). I know this syntax doesn't exactly jive with and possibly muddies the general :source => op => :sink format, but in this special case it might be a middle ground. I use this syntax in PopGenSims.jl when supplying specific information from two data sources and I find it reasonably convenient/intuitive.

pdeffebach commented 3 years ago

I'm not 100% sure what the original proposal on this thread is, but I would like a version of renamecols that just renames the variable if it is a duplicate and in the data frame on the right.

In the block below, the first option renames all columns, not just duplicates. The second option just adds a _1, which is not very informative.

Maybe makeunique should allow a string?

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"]);

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"]);

julia> name.Job = job.Job;

julia> job.Val = [5.0, 6.4, 7.6];

julia> leftjoin(name, job, on = :ID; renamecols = identity => "_right")
3×5 DataFrame
│ Row │ ID    │ Name      │ Job    │ Job_right │ Val_right │
│     │ Int64 │ String    │ String │ String?   │ Float64?  │
├─────┼───────┼───────────┼────────┼───────────┼───────────┤
│ 1   │ 1     │ John Doe  │ Lawyer │ Lawyer    │ 5.0       │
│ 2   │ 2     │ Jane Doe  │ Doctor │ Doctor    │ 6.4       │
│ 3   │ 3     │ Joe Blogs │ Farmer │ missing   │ missing   │

julia> leftjoin(name, job, on = :ID; makeunique = true)
3×5 DataFrame
│ Row │ ID    │ Name      │ Job    │ Job_1   │ Val      │
│     │ Int64 │ String    │ String │ String? │ Float64? │
├─────┼───────┼───────────┼────────┼─────────┼──────────┤
│ 1   │ 1     │ John Doe  │ Lawyer │ Lawyer  │ 5.0      │
│ 2   │ 2     │ Jane Doe  │ Doctor │ Doctor  │ 6.4      │
│ 3   │ 3     │ Joe Blogs │ Farmer │ missing │ missing  │
nalimilan commented 3 years ago

Yeah, maybe we could more generally allow passing an iterable for makeunique e.g. in hcat too.

bkamins commented 3 years ago

In the block below, the first option renames all columns, not just duplicates.

Actually for me this still feels a more natural thing to do as you know exactly what column names to expect.

However, indeed, an option to rename only non-unique columns can be a nice addition.