JuliaAI / Imbalance.jl

A Julia toolbox with resampling methods to correct for class imbalance.
https://juliaai.github.io/Imbalance.jl/dev/
MIT License
28 stars 1 forks source link

Imbalance doesn't work with categorical data which has `Textual` type #98

Open sylvaticus opened 7 months ago

sylvaticus commented 7 months ago

Take this example:

a = DataFrame(a =["a","b","c"], b=[1,2,3])
b = ["a","a","b"]
Xover, yover = random_oversample(a, b)   

This fails because ScientificTypes.schema(X).scitypes fails, but the algorithms emploied doesn't really care about scientific types.

EssamWisam commented 7 months ago

For consistency, as in the documentation of each method, all methods at most assume that

X: A matrix of real numbers or a table with element scitypes that subtype Union{Finite, Infinite}.

Here the column a has scientific type Textual and should be coerced to MultiClass first. I see that the error does not seem to signal that directly and will consider making a PR that throws a better one.

We can also add support for the Textual type but it may not be exactly straightforward given the current implementation.

ablaom commented 7 months ago

So, for example, this works:

using CategoricalArrays
a = DataFrame(a =categorical(["a","b","c"]), b=[1,2,3]);
ablaom commented 7 months ago

Or:

using ScientificTypes
coerce!(a, a: => Multiclass)
ablaom commented 6 months ago

But I agree with @sylvaticus that it's worth special casing this algorithm which can deal with arbitrary column types.