DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

XGBoost Fit: XGBoostError: value 0 for Parameter num_class should be greater equal to 1 #1190

Closed lambda-science closed 3 years ago

lambda-science commented 3 years ago

Describe the bug When calling the .fit() method of the PrecisionRecallCurve class on a XGBoost Multiclass Classifier it raises an error: XGBoostError: value 0 for Parameter num_class should be greater equal to 1 num_class: Number of output class in the multi-class classification.

To Reproduce

# Train a XGBClassifieir with your data and parameters after optimization.
classes = np.unique(y_train)
est = xgb.XGBClassifier()
clf = clone(est).set_params(**best_trial.params)
model = clf.fit(x_train, y_train)

# Prediction evaluation
y_pred = clf.predict(x_test)

# ROC Curve from YellowBricks (working)
visualizer = ROCAUC(model, classes=classes)
visualizer.fit(x_train, y_train)  
visualizer.score(x_test, y_test) 

# PreRec Curve from YellowBricks  (error)
viz = PrecisionRecallCurve(model, classes=classes)
viz.fit(x_train, y_train)
viz.score(x_test, y_test)

Dataset I use my own dataset and it is not the issue as it is working for 9+ other ML methods.

Expected behavior I expect num_class to be set automatically as it is supposed to be done when calling .fit() on a XGBClassifier.

Traceback Yellowbricks code for PRCruve is in prcurve.py as I tried to extract the code to work on it after the error, not successful.

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
~/scikit_ML_Pipeline_Binary_Notebook/modeling_methods.py in run_XGB_full(x_train, y_train, x_test, y_test, randSeed, i, param_grid, name_path, hype_cv, n_trials, scoring_metric, timeout, wd_path, output_folder, algorithm, data_name, type_average)
   1516     # PRECISION RECALL - For each class
   1517     viz = PrecisionRecallCurve(model, classes=classes)
-> 1518     viz.fit(x_train, y_train)
   1519     viz.score(x_test, y_test)
   1520     prec = viz.precision_["micro"]

~/scikit_ML_Pipeline_Binary_Notebook/prcurve.py in fit(self, X, y)
    286 
    287         # Fit the model and return self
--> 288         return super(PrecisionRecallCurve, self).fit(X, Y)
    289 
    290     def score(self, X, y):

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/yellowbrick/classifier/base.py in fit(self, X, y, **kwargs)
    174         """
    175         # Super fits the wrapped estimator
--> 176         super(ClassificationScoreVisualizer, self).fit(X, y, **kwargs)
    177 
    178         # Extract the classes and the class counts from the target

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/yellowbrick/base.py in fit(self, X, y, **kwargs)
    388         """
    389         if not check_fitted(self.estimator, is_fitted_by=self.is_fitted):
--> 390             self.estimator.fit(X, y, **kwargs)
    391         return self
    392 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/multiclass.py in fit(self, X, y)
    279         # n_jobs > 1 in can results in slower performance due to the overhead
    280         # of spawning threads.  See joblib issue #112.
--> 281         self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_binary)(
    282             self.estimator, X, column, classes=[
    283                 "not %s" % self.label_binarizer_.classes_[i],

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/joblib/parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
    220     def __call__(self, *args, **kwargs):
    221         with config_context(**self.config):
--> 222             return self.function(*args, **kwargs)

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/multiclass.py in _fit_binary(estimator, X, y, classes)
     83     else:
     84         estimator = clone(estimator)
---> 85         estimator.fit(X, y)
     86     return estimator
     87 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
    431         for k, arg in zip(sig.parameters, args):
    432             kwargs[k] = arg
--> 433         return f(**kwargs)
    434 
    435     return inner_f

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1174         )
   1175 
-> 1176         self._Booster = train(
   1177             params,
   1178             train_dmatrix,

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks)
    187     Booster : a trained booster model
    188     """
--> 189     bst = _train_internal(params, dtrain,
    190                           num_boost_round=num_boost_round,
    191                           evals=evals,

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks, evals_result, maximize, verbose_eval, early_stopping_rounds)
     79         if callbacks.before_iteration(bst, i, dtrain, evals):
     80             break
---> 81         bst.update(dtrain, i, obj)
     82         if callbacks.after_iteration(bst, i, dtrain, evals):
     83             break

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
   1494 
   1495         if fobj is None:
-> 1496             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
   1497                                                     ctypes.c_int(iteration),
   1498                                                     dtrain.handle))

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/xgboost/core.py in _check_call(ret)
    208     """
    209     if ret != 0:
--> 210         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    211 
    212 

XGBoostError: value 0 for Parameter num_class should be greater equal to 1
num_class: Number of output class in the multi-class classification.

Desktop (please complete the following information):

Additional context https://stackoverflow.com/questions/40116215/xgboost-sklearn-wrapper-value-0for-parameter-num-class-should-be-greater-equal-t As per the stackoverflow link, XGBoost is supposed to set automatically this parameter. This is not the case. I spend hours and hours trying to find a workaround, setting it by hand, before, but also in the .fit() method. Trying to skip the .fit() method as my Classifier is already trained...... Nothing works, I'm kind of depressed, have anyone used Yellowbrick PreRec Curve with XGboost ? Seem's weird that the AUC Curve does not throw any errors.

pdamodaran commented 3 years ago

Hi @Aperture77 - thank you for using Yellowbrick! Sorry you have been having trouble with the PrecisionRecallCurve visualizer. In looking at the error message, it appears that the model is getting fitted again and it is possible that the error is occurring because of this. Since you already fitted the model, pass in is_fitted=True to the visualizer and hopefully this will resolve your issue.

lambda-science commented 3 years ago

Hi @Aperture77 - thank you for using Yellowbrick! Sorry you have been having trouble with the PrecisionRecallCurve visualizer. In looking at the error message, it appears that the model is getting fitted again and it is possible that the error is occurring because of this. Since you already fitted the model, pass in is_fitted=True to the visualizer and hopefully this will resolve your issue.

Hello thanks for the answer. Here is the traceback with is_fitted=True.

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
/enadisk/maison/genomics18/xxxx/code-project/scikit_ML_Pipeline_Binary_Notebook/modeling_methods.py in run_XGB_full(x_train, y_train, x_test, y_test, randSeed, i, param_grid, name_path, hype_cv, n_trials, scoring_metric, timeout, wd_path, output_folder, algorithm, data_name, type_average)
   1518     viz = PrecisionRecallCurve(model, classes=classes, is_fitted=True)
   1519     viz.fit(x_train, y_train)
-> 1520     viz.score(x_test, y_test)
   1521     prec = viz.precision_["micro"]
   1522     recall = viz.recall_["micro"]

/enadisk/maison/genomics18/xxxx/code-project/scikit_ML_Pipeline_Binary_Notebook/prcurve.py in score(self, X, y)
    312         # Call super to check if fitted and to compute classes_
    313         # Note that self.score_ computed in super will be overridden below
--> 314         super(PrecisionRecallCurve, self).score(X, y)
    315 
    316         # Compute the prediction/threshold scores

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/yellowbrick/classifier/base.py in score(self, X, y)
    236 
    237         # This method implements ScoreVisualizer (do not call super).
--> 238         self.score_ = self.estimator.score(X, y)
    239         return self.score_
    240 

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    498         """
    499         from .metrics import accuracy_score
--> 500         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    501 
    502     def _more_tags(self):

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/multiclass.py in predict(self, X)
    356             Predicted multi-class targets.
    357         """
--> 358         check_is_fitted(self)
    359 
    360         n_samples = _num_samples(X)

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/anaconda3/envs/ML-pipeline/lib/python3.9/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
   1096 
   1097     if not attrs:
-> 1098         raise NotFittedError(msg % {'name': type(estimator).__name__})
   1099 
   1100 

NotFittedError: This OneVsRestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

With is_fitted=True, the PrecisionRecallCurve is trying to warp the XGBoostClassifier with a OneVsRestClassifier that is not fitted.

From the .fit() doc:

    Fit the classification model; if ``y`` is multi-class, then the estimator
   is adapted with a ``OneVsRestClassifier`` strategy, otherwise the estimator
   is fit directly.

My classification is multiclass so it is expected. If I skip the .fit() method and directly score:


---------------------------------------------------------------------------
NotFitted                                 Traceback (most recent call last)
/enadisk/maison/genomics18/xxxx/code-project/scikit_ML_Pipeline_Binary_Notebook/modeling_methods.py in run_XGB_full(x_train, y_train, x_test, y_test, randSeed, i, param_grid, name_path, hype_cv, n_trials, scoring_metric, timeout, wd_path, output_folder, algorithm, data_name, type_average)
1518     viz = PrecisionRecallCurve(model, classes=classes, is_fitted=True)
1519     # viz.fit(x_train, y_train)
-> 1520     viz.score(x_test, y_test)
1521     prec = viz.precision_["micro"]
1522     recall = viz.recall_["micro"]

/enadisk/maison/genomics18/xxxx/code-project/scikit_ML_Pipeline_Binary_Notebook/prcurve.py in score(self, X, y) 303 # has not correctly been fitted for multi-class targets. 304 if not hasattr(self, "targettype"): --> 305 raise NotFitted.from_estimator(self, "score") 306 307 # Must perform label binarization before calling super

NotFitted: this PrecisionRecallCurve instance is not fitted yet, please call fit with the appropriate arguments before using score


Because the "target_type_" that is instanciated in the fit() method, is not set.
pdamodaran commented 3 years ago

We discovered that there is an issue with the is_fitted method in the PRCurve visualizer as the estimator is being wrapped in a OneVsRestClassifier which is then subsequently not being fitted. We have logged an issue for this and will be looking into it.

In the meantime, as the article you posted points out, the num_class parameter is not automatically set as scikit-learn uses the cv method for the OneVsRestClassifier. You can update your code with the following:


# Train a XGBClassifieir with your data and parameters after optimization.
classes = np.unique(y_train)
est = xgb.XGBClassifier(num_class=len(classes))