Open bkamins opened 3 years ago
another test:
julia> x = repeat(1:3, 5);
julia> freqtable(cut(x, 2))
2-element Named Array{Int64,1}
Dim1 │
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │ 5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10
julia> freqtable(cut(x, 3))
3-element Named Array{Int64,1}
Dim1 │
─────────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)" │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q3: [2.33333, 3.0]" │ 5
julia> freqtable(cut(x, 4))
2-element Named Array{Int64,1}
Dim1 │
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │ 5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10
julia> freqtable(cut(x, 5))
ERROR: ArgumentError: cannot compute 5 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.
julia> freqtable(cut(x, 6))
4-element Named Array{Int64,1}
Dim1 │
─────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)" │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.0)" │ 0
CategoricalValue{String,UInt32} "Q3: [2.0, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q4: [2.33333, 3.0]" │ 5
julia> freqtable(cut(x, 7))
ERROR: ArgumentError: cannot compute 7 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.
As a reference in these cases this is what dplyr produces:
> x = c(rep(1, 1000), rep(2, 100), rep(3, 10), 4)
> table(cut_number(x, 2))
Error: Insufficient data values to produce 2 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 3))
Error: Insufficient data values to produce 3 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 4))
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.
and
> x = rep(1:3, 5)
> table(cut_number(x, 2))
[1,2] (2,3]
10 5
> table(cut_number(x, 3))
[1,1.67] (1.67,2.33] (2.33,3]
5 5 5
> table(cut_number(x, 4)) # same with higher values
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.
We should probably check in the cut(x, ngroups)
method that the created array has a number of levels equal to the requested number of groups. The question of what to do in tricky cases when calling cut(x, breaks)
directly is more open.
Agreed. But as noted on Slack there are two use-cases of cut(x, ngroups)
:
ngroups
- and then we should errorngroups
but be sure that the function will not error on production (this is quite common if you do a pipeline preprocessing your 10,000 columns and you do not want 1 column that is not typical to cause error)Maybe we can make option 1. the default, and option 2. as opt-in in which case cut
never errors but tries to do the best thing it can?
Yes we could but we would have to check all possible problems. cut(x, ngroups)
is simple but cut(x, breaks)
might rely on throwing errors to avoid returning invalid results, so if we change it we have to be very careful.
I meant cut(x, ngroups)
. If someone passes breaks
we should be strict I think.
@nalimilan - is this intended:
?