TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

@summarize gives unexpected result with StatsBase.mad #103

Closed tobydriscoll closed 4 months ago

tobydriscoll commented 5 months ago

Tried to use @summarize with median/mad like with mean/std, and it failed.

For example,

using TidierData, Statistics, StatsBase
df = DataFrame((; year=repeat(1982:1984, inner=4), val=rand(12)))
@chain df begin
    @group_by(year)
    @summarize(mean=mean(val), std=std(val))
    @ungroup
end   
 Row │ year   mean      std      
     │ Int64  Float64   Float64  
─────┼───────────────────────────
   1 │  1982  0.747423  0.198018
   2 │  1983  0.383799  0.242957
   3 │  1984  0.610842  0.359883

Brilliant! But:

using TidierData, Statistics, StatsBase
df = DataFrame((; year=repeat(1982:1984, inner=4), val=rand(12)))
@chain df begin
    @group_by(year)
    @summarize(median=median(val), mad=mad(val))
    @ungroup
end  
 Row │ year   median    mad     
     │ Int64  Float64   Float64 
─────┼──────────────────────────
   1 │  1982  0.760522      0.0
   2 │  1982  0.760522      0.0
   3 │  1982  0.760522      0.0
   4 │  1982  0.760522      0.0
   5 │  1983  0.460224      0.0
  ⋮  │   ⋮       ⋮         ⋮
   9 │  1984  0.627629      0.0
  10 │  1984  0.627629      0.0
  11 │  1984  0.627629      0.0
  12 │  1984  0.627629      0.0
                  3 rows omitted

Sadness. There should be no problem using mad, AFAICT.

julia> mad(df.val)
0.32603152946376807
kdpsingh commented 5 months ago

This is easily fixable. I'll explain why it's happening shortly. It's a straightforward explanation. There is a workaround as well.

kdpsingh commented 5 months ago

The reason this behavior happens occasionally in TidierData.jl is that the package tries to infer whether a function should be vectorized (i.e., run separately on each element of a vector) or not (i.e., run on the entire vector).

Since most functions and operators do require vectorization, TidierData defaults to vectorizing functions and operators unless it knows not to. The way it knows which ones not to vectorize is using a look-up table. This is called "auto-vectorization" and is part of the magic (for good and bad) of TidierData.

mean() happens to be part of the look-up table whereas mad() is not.

In a future update, we will add mad() to that list. For now, the workaround is to add a tilde prefix, which marks the function for TidierData as one not to vectorize:

using TidierData, Statistics, StatsBase
df = DataFrame((; year=repeat(1982:1984, inner=4), val=rand(12)))
@chain df begin
    @group_by(year)
    @summarize(median=median(val), mad=~mad(val))
    @ungroup
end  

Or you can also add it in your session to the do-not-vectorize list.

More details on this behavior and how to do this are located in the documentation page here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/

tobydriscoll commented 5 months ago

That makes sense. Thanks and KUTGW! I'll leave the issue open since you intend to make a change.

kdpsingh commented 5 months ago

Thanks! Yes, I'll close the issue after adding mad() to the do-not-vectorize list.

kdpsingh commented 4 months ago

This is fixed in #107.