Open balintbiro opened 8 months ago
I set the np.random.seed(0)
and reproduced the error. XGBoost requires encoded labels, meaning the label should start from 0 and end at n_classes - 1
. In your example, np.unique(y_train)
:
[ 0 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
As one can see, it's not contiguous. The solution is that you should fit the label encoder on the training data instead. The second issue is that, since the label consists of discrete values, you might consider train_test_split(X, y, stratify=y)
for properly distributed classes.
While this is a very reasonable thing to require from users, it seems to be a requirement for full scikit-learn compatibility according to their docs and tests: https://scikit-learn.org/stable/developers/develop.html#specific-models
I see what you mean. Thank y'all for the answers!
I also agree this is too restrictive. all other sklearn models are fine with this, and this case can occur when doing a cross_val_score using xgboost, even if using stratified because it can miss one or more classes still.
Hi All,
I am facing a problem with the mixture of LabelEncoder and XGBClassifier. Below is the reproducible example that causes the problem.
Any ideas why this training is terminated? This issue is a bit similar to https://github.com/dmlc/xgboost/issues/9747 however there are no nan values in y. In my opinion, this is related to XGBoost since it is possible to train other classifiers no problem. Thanks in advance!