Closed Wuuzzaa closed 2 years ago
Hello,
Thank you for your message.
Imbalanced classes should not be a problem as the implementation considers the total number of classes (from the original dataset) for each node as new columns to be added. If you don't mind, could you please share with me your dataset, the parameter configuration of LCEClassifier and the error using the package version 0.2.6?
Thank you in advance and I remain at your disposal for any additional information.
Best,
Hello,
I was wrong. It works in the current version (think in older too). But only when you use a XGB classifier. This is the case in this repo. I experiment with different sklearn estimators as baseestimator for LCE.
So when you change this line: https://github.com/LocalCascadeEnsemble/LCE/blob/9b6f1431ced03c65f4917562672d6283c06d6f4d/lce/_lcetree.py#L266
To create a different estimator e.g a random forest then you will run into an error like this:
File "*\_lcetree.py", line 300, in _create_node
X[:, -1] = pred_proba[:, 1]
IndexError: index 1 is out of bounds for axis 1 with size 1
The code and settings i used:
from sklearn.datasets import make_classification
from lce import LCEClassifier
import numpy as np
random_state = 0
# make a dataset
# class 0: 50 samples
# class 1: 50 samples
# class 2: 1 sample
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=random_state)
X = np.vstack([X, np.zeros(shape=X.shape[1])])
y = np.hstack([y, np.array(2)])
# fit lce
clf = LCEClassifier(n_jobs=-1, random_state=random_state)
clf.fit(X, y)
Hello,
Thank you for your message.
Best,
Hi,
found a bug with heavy unbalanced classes for classification problems.
Bug description
Lets assume we have 3 different classes. The classes are unbalanced like this.
Class 0: 5000 samples Class 1: 1000 samples Class 2: 10 samples
At train time it is possible that a class is not represented in the subsample for a new node (only the rootnode do not have this problem).
At prediction the model of the node can get a sample of a class it does not know from train data. This leads to an error.
Fix idea
Check if we have at least one sample of each class in our subsample train data for a node. When not a a pseudo sample like all features set to zero or the mean value.
made a pull request #4