JuliaML / MLUtils.jl

Utilities and abstractions for Machine Learning tasks
MIT License
109 stars 22 forks source link

oversample and undersample always return classes as well #116

Closed CarloLucibello closed 2 years ago

CarloLucibello commented 2 years ago

Fix #113 by having the implementation adhere to the docs instead of changing the docs. The resampled classes are now always returned.

Also, made the under/oversample calls deterministic when shuffle=false.

Since the change is breaking with respect to previous behavior (but non-breaking with respect to the behavior declaimed in the docs) I'm also updating the minor version.

SimonEnsemble commented 2 years ago

here b/c I am confused about how oversample works.

# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
# oversample the class "a" to match "b"
X_bal, Y_bal = oversample(X, Y)
# this results in a bigger dataset with repeated data
@assert size(X_bal) == (3,8)
@assert length(Y_bal) == 8
# now both "a", and "b" have 4 observations each
@assert sum(Y_bal .== "a") == 4
@assert sum(Y_bal .== "b") == 4

does not hold as advertised...