JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

Document custom generation of column names in manual #3430

Open schlichtanders opened 3 months ago

schlichtanders commented 3 months ago

I am looking for a fix or workaround for how to use AsTable in combination with several columns which should be transformed, i.e. .=>.

I always get ERROR: ArgumentError: Duplicate column name(s) returned:

df = DataFrame(a = 1:10, b = 4:13)
function myextrema(a)
    ex = extrema(a)
    (min=ex[1], max=ex[2])
end

combine(df, :a => myextrema => AsTable)  # works 
combine(df, [:a, :b] .=> myextrema .=> AsTable)  # fails

throws the following error

ERROR: ArgumentError: Duplicate column name(s) returned: :min, :max
Stacktrace:
[1] select_transform!(::Base.RefValue{…}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{…}, copycols::Bool, allow_resizing_newdf::Base.RefValue{…}, column_to_copy::BitVector)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:838
[2] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1778
[3] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1698
[4] #manipulate#599
@ ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1833 [inlined]
[5] combine(df::DataFrame, args::Any; renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1669
[6] top-level scope
@ REPL[125]:1

My ideal behaviour would be that AsTable prepends the column name, but of course this would be breaking. Maybe there could be a PrependColName(AsTable) wrapper or something similar?

bkamins commented 3 months ago

This is the intended way to do it:

julia> combine(df, [:a, :b] .=> myextrema .=> x -> x .* ["_min", "_max"])
1×4 DataFrame
 Row │ a_min  a_max  b_min  b_max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     10      4     13

You can then even do just e.g.:

julia> combine(df, [:a, :b] .=> Ref∘extrema .=> x -> x .* ["_min", "_max"])
1×4 DataFrame
 Row │ a_min  a_max  b_min  b_max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     10      4     13
schlichtanders commented 3 months ago

Thank you very much - I couldn't find such an example in the documentation.

I still don't understand why your second version works :sweat_smile:.

This approach has the disadvantage that one needs to replicate which fields the transformation function has. Looks flexible, and easy to understand, which is really great, but also like duplication.

bkamins commented 3 months ago
  1. It is documented that to produce multiple columns you have to either pass AsTable or a vector of column names.
  2. It is documented that you can auto-generate the target column names using a function (to dynamically generate them). In this case the function takes source column names as input.

This approach has the disadvantage that one needs to replicate which fields the transformation function has.

Yes - this is a disadvantage. That is why I have commented that you do not have to pass these column names in the function (the example with Ref, which skips defining target column names).


We could allow for a function taking both "source column names" and "names returned by a function" and allowing combining them, but it seemed overly complex (i.e. the API would be hard for typical users to understand and learn). What I have given you was the most concise variant.

The variant that you want is available, and it avoids duplication, but the disadvantage is that the code is longer (so I thought that it is less interesting):

julia> using DataFrames

julia> df = DataFrame(a = 1:10, b = 4:13)
10×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6
   4 │     4      7
   5 │     5      8
   6 │     6      9
   7 │     7     10
   8 │     8     11
   9 │     9     12
  10 │    10     13

julia> function myextrema(a)
           ex = extrema(a[1])
           n = propertynames(a)[1]
           (; Symbol(n, "_min") => ex[1], Symbol(n, "_max") => ex[2])
       end
myextrema (generic function with 1 method)

julia>

julia> combine(df, AsTable.([:a, :b]) .=> myextrema .=> AsTable) 
1×4 DataFrame
 Row │ a_min  a_max  b_min  b_max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     10      4     13
schlichtanders commented 3 months ago

2. It is documented that you can auto-generate the target column names using a function (to dynamically generate them). In this case the function takes source column names as input.

Could an example be added to https://dataframes.juliadata.org/stable/man/working_with_dataframes/? This was my source of truth and there I couldn't find it.

bkamins commented 3 months ago

There is an example in the docstring. https://dataframes.juliadata.org/stable/lib/functions/#DataFrames.combine. We could add also something in the intro manual. Could you propose something that you would find most useful?

schlichtanders commented 3 months ago

I think just below .=> within the combine Section would be nice

julia> combine(df, names(df) .=> sum, names(df) .=> prod)
1×4 DataFrame
 Row │ A_sum  B_sum    A_prod  B_prod
     │ Int64  Float64  Int64   Float64
─────┼─────────────────────────────────
   1 │    10     10.0      24     24.0

# this is new:
julia> combine(df, names(df) .=> Ref ∘ extrema .=> (c -> c .* ["_min", "_max"]))

Probably with a little extra explanation what the Ref is doing here (I haven't entirely understood its need yet).

bkamins commented 3 months ago

https://bkamins.github.io/julialang/2024/03/22/minicontainers.html

bkamins commented 3 months ago

See #3433 for an update of the manual. Of course please comment if something is not clear or should be improved.

schlichtanders commented 3 months ago

looks especially good. Thank you for the detailed documentation improvement!