JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 34 forks source link

another take at cut #314

Open bkamins opened 3 years ago

bkamins commented 3 years ago

@nalimilan - is this intended:

julia> x = [fill(1,1000); fill(2, 100); fill(3, 10); 4];

julia> levels(cut(x, 2))
1-element Array{String,1}:
 "Q1: [1.0, 4.0]"

julia> levels(cut(x, 2, allowempty=true))
1-element Array{String,1}:
 "Q1: [1.0, 4.0]"

julia> cut(x, 3)
ERROR: ArgumentError: cannot compute 3 quantiles: `quantile` returned only 0 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> levels(cut(x, 3, allowempty=true))
2-element Array{String,1}:
 "Q1: (1.0, 1.0)"
 "Q2: [1.0, 4.0]"

?

bkamins commented 3 years ago

another test:

julia> x = repeat(1:3, 5);

julia> freqtable(cut(x, 2))
2-element Named Array{Int64,1}
Dim1                                             │ 
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │  5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10

julia> freqtable(cut(x, 3))
3-element Named Array{Int64,1}
Dim1                                                     │ 
─────────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)"     │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q3: [2.33333, 3.0]"     │ 5

julia> freqtable(cut(x, 4))
2-element Named Array{Int64,1}
Dim1                                             │ 
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │  5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10

julia> freqtable(cut(x, 5))
ERROR: ArgumentError: cannot compute 5 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> freqtable(cut(x, 6))
4-element Named Array{Int64,1}
Dim1                                                 │ 
─────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)" │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.0)" │ 0
CategoricalValue{String,UInt32} "Q3: [2.0, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q4: [2.33333, 3.0]" │ 5

julia> freqtable(cut(x, 7))
ERROR: ArgumentError: cannot compute 7 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.
bkamins commented 3 years ago

As a reference in these cases this is what dplyr produces:

> x = c(rep(1, 1000), rep(2, 100), rep(3, 10), 4)
> table(cut_number(x, 2))
Error: Insufficient data values to produce 2 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 3))
Error: Insufficient data values to produce 3 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 4))
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.

and

> x = rep(1:3, 5)
> table(cut_number(x, 2))

[1,2] (2,3] 
   10     5 
> table(cut_number(x, 3))

   [1,1.67] (1.67,2.33]    (2.33,3] 
          5           5           5 
> table(cut_number(x, 4)) # same with higher values
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.
nalimilan commented 3 years ago

We should probably check in the cut(x, ngroups) method that the created array has a number of levels equal to the requested number of groups. The question of what to do in tricky cases when calling cut(x, breaks) directly is more open.

bkamins commented 3 years ago

Agreed. But as noted on Slack there are two use-cases of cut(x, ngroups):

  1. user wants exactly ngroups - and then we should error
  2. user wants approximately ngroups but be sure that the function will not error on production (this is quite common if you do a pipeline preprocessing your 10,000 columns and you do not want 1 column that is not typical to cause error)

Maybe we can make option 1. the default, and option 2. as opt-in in which case cut never errors but tries to do the best thing it can?

nalimilan commented 3 years ago

Yes we could but we would have to check all possible problems. cut(x, ngroups) is simple but cut(x, breaks) might rely on throwing errors to avoid returning invalid results, so if we change it we have to be very careful.

bkamins commented 3 years ago

I meant cut(x, ngroups). If someone passes breaks we should be strict I think.