JuliaAI / Imbalance.jl

A Julia toolbox with resampling methods to correct for class imbalance.
https://juliaai.github.io/Imbalance.jl/dev/
MIT License
28 stars 1 forks source link

Documentation clarifications #86

Closed sylvaticus closed 8 months ago

sylvaticus commented 8 months ago

Hello, I think the documentation is well done and very complete. Just a few comments:

  1. What I miss, as a naive user, is some guidance (just some sentence) on when to use one imbalance model or the other. Is there one imbalance model that is clearly superior nowadays ? Is producing sintetic data an advantage or a disadvantage compared to copy existing data ? What about oversampling vs undersampling ?

  2. Are the TableInterface or MLJ interface different from model to model? If not, while it is important to have them documented, I think in the web site they make the documentation a bit redundant and difficult to follow.. I would just make a page for each of them (something like "MLJ Interface", "TableTransform Interface"), making just one full example and then saying they are available for [list of imbalance models], or perhaps list them with the options...

  3. I haven't understand, for the smote function, how it is the default value for k, e.g. the sentence "It will be automatically set to m-1 for any class with m points where m ≤ k."

  4. It's not clear to me what happens to the dataset when you use multiple balancers

  5. In the "Combining Resamplers" page, I think you should remove the words "and in general" in the sentence "In prediction, the resamplers balancer1 and blancer2 are bypassed and in general."

I have tested all the algorithms with some continuous data and they seems to work. I have also tested with some missing data in the X, but no algorithm works. I thought that at least the algorithms that simply randomly under/oversample should work with missing data, perhaps a note on this ?

Thank you for the great package!

[part of https://github.com/https://github.com/openjournals/joss-reviews/issues/6310 ]

EssamWisam commented 8 months ago

Thank you. Appreciate the feedback.

For (1), I remember deciding with @ablaom that the docs wouldn't be the best place for this which is why I created a series of four articles which are linked in the docs (at the bottom of the main page) that aside from explaining the implemented algorithms give good hints for questions like Is producing sintetic data an advantage or a disadvantage compared to copy existing data ? and What about oversampling vs undersampling ? The question Is there one imbalance model that is clearly superior nowadays ? isn't exactly explored in them but isn't easy to answer either as papers come from different times and evaluate on different datasets and because it's generally case-dependent. As in all machine learning, the user could always consider the choice of resampler itself as a hyperparameter to find what works best on their problem and I believe MLJTuning.jl may help in that.

Regarding (2), I absolutely agree that it could be viewed as redundant as these APIs are similar but didn't expect them to be as well difficult to follow. May in the future discuss and consider with @ablaom making a single page for MLJ and another forTableTransforms where each explaisn how to infer the API given the pure functional interface and then lists the examples.

Regarding (3), the sentence It will be automatically set to m-1 for any class with m points where m ≤ k. In SMOTE, KNN is applied for data belonging to each class separately. Suppose k is chosen as 10 but one of the classes has 9 points then it will be set as 9-1=8 for that class as any point has 8 neighbors at most. I see now that not anyone using SMOTE knows this detail and will think if there is a better way to phrase it for that.

Regarding (4), a balancer is a mapping from X, y to X',y' where the data samples in X, y, are increased or decreased depending on whether it's oversampling or undersampling respectively. Consider X with three classes and a ratios of 1.0 this would oversample the data from all classes so that each of them has as much data as the majority class resulting in X',y' suppose we further use an undersampler such as ENN with min_ratios=0.8 this would clean the data around the decision boundaries to remove (possibly) synthetic points in X',y' that likely would have made the classification task even harder while maintaining that no class falls bellow 80% of the number of points it had in X',y' resulting in the transformed dataset X'',y''. Does that make it clearer? I think having a clear of idea of what a single balancer does is enough to conclude what a chain would do.

Regarding (5), indeed, thanks.

I have also tested with some missing data in the X, but no algorithm works. I thought that at least the algorithms that simply randomly under/oversample should work with missing data, perhaps a note on this ?

Fair point but deserves a separate issue optionally with a minimal reproducible example.

Thank you again for the great feedback!

CC @ablaom

jbytecode commented 8 months ago

Due to the JOSS submission

https://github.com/openjournals/joss-reviews/issues/6310

EssamWisam commented 8 months ago

@sylvaticus I hope I appropriately responsed to all your questions/concerns. As an update, I improved the documentation for the third point (I stopped specifying the minor detail because it's covered in a warning internally anyway).

sylvaticus commented 8 months ago

Hello, while re-reading the package doc, I discovered an imprecision. On the page https://juliaai.github.io/Imbalance.jl/dev/examples/effect_of_s/effect_of_s/ you cite Decision Trees, then you describe what is essentially a kernel perceptron, and finally you employ in the example Bayesian linear discrimination analysis :-)

EDIT: and then in https://juliaai.github.io/Imbalance.jl/dev/examples/smotenc_churn_dataset/smotenc_churn_dataset/ you says "Let's go for a logistic classifier form MLJLinearModels" but then you go for decision trees.. attention to these copy/paste errors..

EssamWisam commented 8 months ago

I see the error in the first one and I'm on the way to fix it. It's likely that I copied the previous notebook and modified it and missed this detail. Thanks for pointing that out.

I don't see the error in the second one. I say:

Let's go for a decision tree from BetaML. We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features.

which is technically correct.

sylvaticus commented 8 months ago

Yes, but a few lines earlier you says "Let's go for a logistic classifier form MLJLinearModels".

I think you need just to remove that sentence.

sylvaticus commented 8 months ago

Here a MRE for the error when the X matrix contains missing / non numerical values :


using Imbalance

y = ["A", "A", "B", "A", "B"]
X = [1 1.1 2.1;
     1 1.2 2.2;
     2 1.3 2.3;
     1 1.4 2.4;
     2 1.5 2.5; ]
Xover, yover = random_oversample(X, y) # ok

X = [1 1.1 2.1;
     1 1.2 2.2;
     2 1.3 2.3;
     1 1.4 missing;
     2 1.5 2.5; ]
Xover, yover = random_oversample(X, y) # error

X = ["a" 1.1 2.1;
     "a" 1.2 2.2;
     "b" 1.3 2.3;
     "a" 1.4 2.4;
     "b" 1.5 2.5; ]
Xover, yover = random_oversample(X, y) # error

X = ["a" "a";
     "a" "a";
     "b" "b";
     "a" "a";
     "b" "b"; ]
Xover, yover = random_oversample(X, y) # error

As I said, algorithms that simply randomly under/oversample without generating synthetic data shouldn't basically care anything about the eltype of the matrix, at least that's what users may expect. If the implementation is such that the matrix is converted in something and handling that cases are problematic independently of the algorithm, perhaps this would deserve a note in the documentation...

EssamWisam commented 8 months ago

Thank you for the comment.

perhaps this would deserve a note in the documentation...

The documentation so far suggests that matrix inputs should be of type Real. As far as I know, the type Missing (or type String) aren't instances of that. Likewise, it is also mentioned that tables can accept categorical inputs (so that's what should be used when the matrix has strings).

As I said, algorithms that simply randomly under/oversample without generating synthetic data shouldn't basically care anything about the eltype of the matrix, at least that's what users may expect.

Irrespective of what I said above, I agree with this. I will make a PR soon so that when the input is a matrix, random_oversample and random_undersample accept real numbers as well as missing values.

EssamWisam commented 8 months ago

Didn't mean to close it. Github did that by itself I guess.

In the PR above, I fixed the documentation for the two tutorials and made random_oversample and random_undersample accept missing values for matrix inputs (with tests).

sylvaticus commented 8 months ago

Sorry, I didn't see the "A matrix of real numbers " in the doc. Concerning the second typo in the documentation, you ended up repeating the sentence "Let's go for a decision tree from BetaML" . I made a small pull request to correct it, you can then close this issue, thank you.