JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.72k stars 367 forks source link

Simplify computation of grouped minimum while handling missing values #3328

Open xiaobaaaa opened 1 year ago

xiaobaaaa commented 1 year ago

I recently encountered a situation where I needed to compute the minimum value of a variable grouped by another variable while ignoring missing values. In Stata, this can be achieved using the following simple code:

bysort idcode: egen first_union = min(union_year)

In Julia, using the DataFrames.jl package, the equivalent operation requires more verbose code:

using DataFrames

df = DataFrame(idcode = [1, 1, 2, 2], union_year = [missing, missing, 2001, 2002])

function min_skipmissing(x)
    v = collect(skipmissing(x))
    return isempty(v) ? missing : minimum(v)
end

transform!(groupby(df, :idcode), :union_year => min_skipmissing => :first_union)

I find this code to be more complicated and less readable than the equivalent Stata code. It would be great if DataFrames.jl could provide a simpler and more concise way to perform this operation.

Thank you for your consideration.

bkamins commented 1 year ago

In DataFrames.jl for minimum you can just write:

transform!(groupby(df, :idcode), :union_year => minimum => :first_union)

However, indeed for other aggregations it is problematic. I should be resolved in Statistics.jl or StatsBase.jl, but I will keep it open here until we reach a decision what to do about it.

bkamins commented 1 year ago

Just to make it clear, would you also expect minimum(Int[]) to return missing or it should throw an error?

bkamins commented 1 year ago

Ah - now I run your example. You need:

transform!(groupby(df, :idcode), :union_year => (x -> minimum(skipmissing(x), init=missing) => :first_union)

that is indeed a bit verbose (but I think better and more efficient than what you currently use).

xiaobaaaa commented 1 year ago

Thank you very much for your response. The result I obtained when running the code you provided, transform!(groupby(df, :idcode), :union_year => (x -> minimum(skipmissing(x), init=missing)) => :first_union) (The code you provided is missing half of a parenthesis, ")". Is my correction accurate?) was:

4×3 DataFrame
 Row │ idcode  union_year  first_union 
     │ Int64   Int64?      Missing
────┼─────────────────────────────────
   1 │      1     missing      missing
   2 │      1     missing      missing
   3 │      2        2001       missing
   4 │      2        2002       missing

but I was hoping to achieve the following result:

4×3 DataFrame
 Row │ idcode  union_year  first_union 
     │ Int64   Int64?      Missing
─────┼─────────────────────────────────
   1 │      1     missing      missing
   2 │      1     missing      missing
   3 │      2        2001       2001      
   4 │      2        2002       2001      
bkamins commented 1 year ago

Ah - indeed. My code is incorrect (I was writing it from my head and I designed it wrongly).

I am already discussing with JuliaData/JuliaStats maintainers how to best resole your issue.