dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

How can Random Forest outperform XGBoost? #10827

Open rohan472000 opened 1 month ago

rohan472000 commented 1 month ago

Hi Viewer,

I am performing predictions using both XGBoost and Random Forest models on a dataset, but I consistently observe that the Random Forest model achieves better scores and correlation values compared to XGBoost, even though I am using extensive hyperparameter tuning for both models. Below are the hyperparameter grids I am using for tuning:

        # XGBoost parameter grid
        param_grid_xgb = {
            'n_estimators': [100, 200, 300, 350, 400, 500],
            'max_depth': [13, 15, 17, 27, 30, 35],
            'learning_rate': [0.01, 0.1, 0.2, 0.3, 0.001, 0.5],
        }

        # Random Forest parameter grid
        param_grid_rf = {
            'n_estimators': [100, 200, 300, 350, 400, 500],
            'max_depth': [13, 15, 17, 27, 30, 35],
            'min_samples_split': [12, 15, 10, 22, 25, 35],
            'min_samples_leaf': [11, 12, 14, 25, 30, 35],
            'max_features': ['sqrt', 'log2'],
        }

Despite trying different combinations of hyperparameters, the Random Forest model consistently outperforms the XGBoost model in terms of R² score and correlation.

To improve performance, I attempted to use a stacking regressor ensemble combining the two models (Random Forest and XGBoost). However, surprisingly, the ensemble results are coming less than Random Forest and greater than XGBoost.,

My questions are:

Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!

trivialfis commented 1 month ago

I don't think one can provide a definite answer to why one model performs better than another. If you want to do more experiments, you can try the num_parallel_tree parameter in combination with subsample to achieve a form of boosted random forest.

6nc0r6-1mp6r0 commented 1 month ago

Hi @rohan472000,

First, I would suggest putting the hyperparameter grids aside for now and instead building one boosted tree at a time until you get a sense of how best to configure the XGBoost hyperparameters for your dataset. In your manual optimization, you could try the following if you have not already:

(1) Specify an eval_metric and plot it for both the training dataset and testing dataset. Assuming this metric is a loss function (i.e., better when smaller), you should stop the training when the eval_metric for the testing dataset is at or near its global minimum, or at least is not significantly improving. If you are stopping training too early or too late, the prediction accuracy of XGBoost may be poor. For example, the optimal value of n_estimator may be much less than 100 or much greater than 500, causing you to miss it in your grid search.

(2) If the eval_metric for the testing dataset has a global minimum, you may be able to achieve (1) more easily and accurately by implementing early stopping with a callback structure. Just look it up if you are interested.

(3) Try smaller values of max_depth. Counterintuitively, large values of max_depth, and more generally large trees, sometimes produce less accurate predictions, perhaps because they over-split the data. You can also try setting other parameters that limit tree size and structure: min_child_weight, min_split_loss, max_leaves, and so on. Be sure to make them restrictive enough that they become the size-limiting constraint.

(4) You can use plot_importance to find the important features and then make a new model trained with only these. The paper "Why do tree-based models still outperform deep learning on tabular data?" by Grinsztajn, Oyallon, and Varoquaux shows that irrelevant features often reduce prediction accuracy, so eliminating these features often raises it.

(5) You can try various schemes of subsampling, bootstrapping, and bagging to increase the robustness of the predictions. As @trivialfis notes, these can include the construction of a random forest within XGBoost itself.

I have not studied the stacking regressor you mention, but it is entirely possible for stacking to yield a less accurate model than one of the combined models. Think of a single prediction: if both predicted values are larger than the true value, then their mean will be between them in value and less accurate than the more accurate one. In practice, stacking often yields more accurate predictions, but without more assumptions this is not guaranteed!

Finally, RandomForest does appear to legitimately outperform a well-configured single-tree XGBoost implementation for some datasets. Examples of this can be found in the paper by Grinsztajn, et al. mentioned above.

Hope this helps.

rohan472000 commented 1 month ago

Hi @6nc0r6-1mp6r0, Thank you so much for your insights and suggestions! I appreciate your help and will implement your suggestions. Regarding the stacking regressor, I appreciate your perspective. It's insightful to consider how the combined predictions might not always lead to better accuracy, especially when both models predict values that are consistently off, I was thinking same but wasn't that much confident.

rohan472000 commented 1 month ago

Hi @trivialfis, Thanks for your valuable inputs, will add them in my code.

gdubs89 commented 1 month ago

This was partly covered in a previous answer, but unless your data is absolutely massive, those max_depths are likely to be far from optimal for xgboost.

Also important to note that if you find max_depth=X is optimal for a RandomForest, you will find that for the same task, the optimal max_depth for xgboost is considerably lower. So while you could run your RandomForest and XGboost over the same grid of MaxDepths, it would need to be quite a big grid. It depends a little on the size of your data, but I'd probably search of MaxDepth=1-6 if your data is fairly small and the bigger it gets, start shifting that to 2-7, 3-8 etc (not an exact science). Conversely, for the RandomForest, if I were going to search over 6 values of MaxDepth, I'd probably (again, depending on the data, you can do a rough calculation to figure out at what depth you're likely to bottom out) search over 6,8,10,12,14,16.

The intuition here is that a random forest is trying to intentionally overfit, with every individual decision tree over-fitting in different ways and this somewhat cancelling out (although you'll still often find that there's some value to not choosing the max depth you possibly can)

XGBoost is trying to intentionally underfit every individual tree and then fit lots of them sequentially to slowly get the loss down that way.

Note that as others have said, it's still possible that even if you do find the optimal hyperparameters, RandomForest will perform better for your particular problem, it's been known to happen, there's no theoretical reason why XGBoost most outperform random forest, although in my experience it almost always does.

Other notes: RandomForests (assume you're using sklearn) let you control complexity via other parameters like min_samples_leaf, which I'd suggest is a better way of doing it, allows for asymmetric trees.

MaxDepth=35 sounds absurdly low to me, unless your data is absolutely massive (like 1 trillion rows). Why is this? 2^35 is like 34 billion. So if a tree grew down to maxdepth 35, you could have as many as 34 billion terminal nodes, and you'd likely want more than 1 datapoint per terminal node. As a good rule of thumb, log2(datasize) is bigger than the maximum max depth you want to probe

rohan472000 commented 1 month ago

Hi @gdubs89, currently my data limits to 4000 only.

gdubs89 commented 1 month ago

4000 is roughly 2^12, so it's unlike that depth in excessive of ~13 will have any effect (that is to say, max_depth=13,14,15,...35) will create the same tree

For such small datasets, I suspect xgboost will perform best at maxDepth=1 or 2.

A generally instructive exercise is to first gridsearch over the MaxDepth parameter for a decision tree under cross-validation. My guess is you'll find the optimal max depth will be 4-8.

Then do the same thing for a random forest (you can have plenty of trees in the forest, as your data is small). You'll find the optimal max depth to be deeper than the decision tree optimal max depth, and you'll find optimal performance is better

Finally, do the same thing for XGBoost (make sure to use a validation set and early stopping). You'll find the optimal maxdepth to be lower (as I said above, I'm guessing 1 or 2) than the decision tree's optimal maxdepth, and you'll also find optimal performance is better than that of the decision tree, and probably better than that of the random forest.

This nicely demonstrates the fundamental difference between boosting and bagging

That said, RandomForest allows you to control complexity via the min_samples_leaf parameter so you could probably still eek out a bit of performance by regularising your random forest in that way rather than via max depth