dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.32k stars 8.73k forks source link

RandomForest in XGBoost #6608

Open sbushmanov opened 3 years ago

sbushmanov commented 3 years ago

The Caveats in docs for running RandomForest in XGBoost are saying:

XGBoost uses 2nd order approximation to the objective function. This can lead to results that differ from a random forest implementation that uses the exact value of the objective function.

It seems to me when you do eg:

xgb = XGBClassifier(
    n_estimators=300,
    max_depth=3,
    objective="binary:logistic",
    eval_metric="logloss",
    use_label_encoder=False,
)
xgb
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=300, n_jobs=12,
              num_parallel_tree=1, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

second order approximation is not the main difference from sklearn's result. The main difference seem to be the objective function: "gini" for sklearn and "logloss" (?) for XGBoost (please correct me if I am wrong)

And the choice of objective function, not the order of approximation, will affect probability calibration curves:

image

image

with calibration curves for XGBoost with booster="gbtree" being perfectly (as expected) calibrated for bigger datasets.

So my proposal here is to add this to the docs (assuming I am right)

hcho3 commented 3 years ago

That's correct, the description should be reworded to point out that gini criterion of random forest is different from the logloss objective used in XGBoost.