Open tk3369 opened 3 years ago
The current signature of e.g. innerjoin
is:
innerjoin(df1, df2; on, makeunique = false, validate = (false, false), renamecols = identity => identity)
I think renamecols
does what you want.
So maybe it's more convenient if I were able to do this in one go:
This can be discussed as a future addition. Let us wait for other users to comment here.
I don't want to rename the columns so renamecols
is not actually helpful. My main concerns are:
Is the code readable? Since the code does not specify where the column came from, it makes it difficult for someone who's doing code review for example.
In most cases, I use joins to bring in specific columns from another data frame. Say, when I join orders
and customers
table, I may want to pull the name of the customers only. It's a little inconvenient to do that as two operations rather than a single one. This little inconvenience accumulates and becomes bigger inconvenience when I have to do the same thing many times.
Maybe these can be solved with a convenient frontend package? Not sure...
The problem is that if we start adding an argument to allow selecting columns in *join
functions, we could also add lots of arguments to other methods. For example, an argument to select a subset of rows to *join
and select
/transform
/combine
would make sense, or an argument to group on a variable to select
/transform
/combine
, etc. Having composable functions gives a much simpler API. And in terms of performance it's exactly the same.
To me
innerjoin(select(tab1, :id, :x, copycols=false),
select(tab2, :id, :y, copycols=false)
on = :id)
is as clean as:
innerjoin(tab1, tab2, on = :id, left_columns = [:x], right_columns = [:y])
One could argue that the former is easier to understand since select
is more familiar to most users than left_columns
/right_columns
.
EDIT: granted, copycols=false
is easy to forget, and it does make a performance difference.
copycols=false
You can just write df[!, cols]
of course which is not verbose.
I think the point is (and this is why I have not rejected the proposal as I initially had the same thought as you) - is as you have noted that you have to repeat :id
column in both select
s, but maye this is not that bad?
Would be interesting to have a DataFramesSQL.jl package that just adds this syntax among other SQL clauses. I tend to find it too verbose for typical use-cases, but there are times when I would have made use of this.
One (possibly outrageous) proposal would be to use the Pair
syntax for the first two positional arguments, where the pair would be dataframe => :col
(or [:col1, :col2, ...]
).
I know this syntax doesn't exactly jive with and possibly muddies the general :source => op => :sink format, but in this special case it might be a middle ground.
I use this syntax in PopGenSims.jl when supplying specific information from two data sources and I find it reasonably convenient/intuitive.
I'm not 100% sure what the original proposal on this thread is, but I would like a version of renamecols
that just renames the variable if it is a duplicate and in the data frame on the right.
In the block below, the first option renames all columns, not just duplicates. The second option just adds a _1
, which is not very informative.
Maybe makeunique
should allow a string
?
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"]);
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"]);
julia> name.Job = job.Job;
julia> job.Val = [5.0, 6.4, 7.6];
julia> leftjoin(name, job, on = :ID; renamecols = identity => "_right")
3×5 DataFrame
│ Row │ ID │ Name │ Job │ Job_right │ Val_right │
│ │ Int64 │ String │ String │ String? │ Float64? │
├─────┼───────┼───────────┼────────┼───────────┼───────────┤
│ 1 │ 1 │ John Doe │ Lawyer │ Lawyer │ 5.0 │
│ 2 │ 2 │ Jane Doe │ Doctor │ Doctor │ 6.4 │
│ 3 │ 3 │ Joe Blogs │ Farmer │ missing │ missing │
julia> leftjoin(name, job, on = :ID; makeunique = true)
3×5 DataFrame
│ Row │ ID │ Name │ Job │ Job_1 │ Val │
│ │ Int64 │ String │ String │ String? │ Float64? │
├─────┼───────┼───────────┼────────┼─────────┼──────────┤
│ 1 │ 1 │ John Doe │ Lawyer │ Lawyer │ 5.0 │
│ 2 │ 2 │ Jane Doe │ Doctor │ Doctor │ 6.4 │
│ 3 │ 3 │ Joe Blogs │ Farmer │ missing │ missing │
Yeah, maybe we could more generally allow passing an iterable for makeunique
e.g. in hcat
too.
In the block below, the first option renames all columns, not just duplicates.
Actually for me this still feels a more natural thing to do as you know exactly what column names to expect.
However, indeed, an option to rename only non-unique columns can be a nice addition.
SQL can be self documenting in joins when you prefix each column. For example:
In DataFrames.jl, it's unclear where the columns came from:
Was it discussed before how to make the code more self documenting?
A related issue is that I often have to limit the columns from the dataframes anyways before joining:
So maybe it's more convenient if I were able to do this in one go:
Thoughts?