dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

Invalid classes inferred from unique values of `y`. #10078

Open balintbiro opened 8 months ago

balintbiro commented 8 months ago

Hi All,

I am facing a problem with the mixture of LabelEncoder and XGBClassifier. Below is the reproducible example that causes the problem.

import string
import xgboost
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

X=pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['col1','col2','col3','col4'])
y=np.random.choice(a=list(string.ascii_uppercase),size=X.shape[0],replace=True)
encoder=LabelEncoder()
y=encoder.fit_transform(y)

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

clf=xgboost.XGBClassifier()
clf.fit(X_train,y_train)

Any ideas why this training is terminated? This issue is a bit similar to https://github.com/dmlc/xgboost/issues/9747 however there are no nan values in y. In my opinion, this is related to XGBoost since it is possible to train other classifiers no problem. Thanks in advance!

trivialfis commented 8 months ago

I set the np.random.seed(0) and reproduced the error. XGBoost requires encoded labels, meaning the label should start from 0 and end at n_classes - 1. In your example, np.unique(y_train):

[ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]

As one can see, it's not contiguous. The solution is that you should fit the label encoder on the training data instead. The second issue is that, since the label consists of discrete values, you might consider train_test_split(X, y, stratify=y) for properly distributed classes.

david-cortes commented 8 months ago

While this is a very reasonable thing to require from users, it seems to be a requirement for full scikit-learn compatibility according to their docs and tests: https://scikit-learn.org/stable/developers/develop.html#specific-models

balintbiro commented 8 months ago

I see what you mean. Thank y'all for the answers!

fpt-ian commented 7 months ago

I also agree this is too restrictive. all other sklearn models are fine with this, and this case can occur when doing a cross_val_score using xgboost, even if using stratified because it can miss one or more classes still.