dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

Bug- Classes_ #8433

Closed jaideep11061982 closed 2 years ago

jaideep11061982 commented 2 years ago

We get an error message on xgb.fit Invalid classes inferred from unique values of y. Expected: [0 1 2 3 4 5], got [1 2 3 4 5 6] self.classes_ = np.unique(np.asarray(y)) self.nclasses = len(self.classes_) expected_classes = np.arange(self.nclasses)

I think we dont need this step,it is not necessary that Train data can have all the classes all the time due to any data availability issue for that class or data present in the valid test set

hcho3 commented 2 years ago

XGBoost requires that the training set contain examples from every class label. If you are using K-fold cross-validation, you should use stratified sampling. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

jaideep11061982 commented 2 years ago

XGBoost requires that the training set contain examples from every class label. If you are using K-fold cross-validation, you should use stratified sampling. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

@hcho3 this is hard requirement unless it has any performance gain . As i mentioned it cannot be always a case that train and valid set will have atleast 1 instance of each class

hcho3 commented 2 years ago

This is a limitation of the current algorithm and we have no intention to change this. The reason is that we fit separate trees for every class, and it's not possible to fit a tree on an empty set.

As I mentioned in my earlier comment, there are ways to create train and validation sets so that every class is represented in every set.