bug unbalanced classes classification node

Wuuzzaa commented 2 years ago

Hi,

found a bug with heavy unbalanced classes for classification problems.

Bug description

Lets assume we have 3 different classes. The classes are unbalanced like this.

Class 0: 5000 samples Class 1: 1000 samples Class 2: 10 samples

At train time it is possible that a class is not represented in the subsample for a new node (only the rootnode do not have this problem).

At prediction the model of the node can get a sample of a class it does not know from train data. This leads to an error.

Fix idea

Check if we have at least one sample of each class in our subsample train data for a node. When not a a pseudo sample like all features set to zero or the mean value.

made a pull request #4

LocalCascadeEnsemble commented 2 years ago

Hello,

Thank you for your message.

Imbalanced classes should not be a problem as the implementation considers the total number of classes (from the original dataset) for each node as new columns to be added. If you don't mind, could you please share with me your dataset, the parameter configuration of LCEClassifier and the error using the package version 0.2.6?

Thank you in advance and I remain at your disposal for any additional information.

Best,

Wuuzzaa commented 2 years ago

Hello,

I was wrong. It works in the current version (think in older too). But only when you use a XGB classifier. This is the case in this repo. I experiment with different sklearn estimators as baseestimator for LCE.

So when you change this line: https://github.com/LocalCascadeEnsemble/LCE/blob/9b6f1431ced03c65f4917562672d6283c06d6f4d/lce/_lcetree.py#L266

To create a different estimator e.g a random forest then you will run into an error like this:

  File "*\_lcetree.py", line 300, in _create_node
    X[:, -1] = pred_proba[:, 1]
IndexError: index 1 is out of bounds for axis 1 with size 1

The code and settings i used:


from sklearn.datasets import make_classification

from lce import LCEClassifier
import numpy as np

random_state = 0

# make a dataset
# class 0: 50 samples
# class 1: 50 samples
# class 2: 1 sample
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=random_state)
X = np.vstack([X, np.zeros(shape=X.shape[1])])
y = np.hstack([y, np.array(2)])

# fit lce
clf = LCEClassifier(n_jobs=-1,  random_state=random_state)
clf.fit(X, y)

LocalCascadeEnsemble commented 2 years ago

Hello,

Thank you for your message.

Best,

LocalCascadeEnsemble / LCE

bug unbalanced classes classification node #5

Bug description

Fix idea