Open rohan472000 opened 1 month ago
I don't think one can provide a definite answer to why one model performs better than another. If you want to do more experiments, you can try the num_parallel_tree
parameter in combination with subsample
to achieve a form of boosted random forest.
Hi @rohan472000,
First, I would suggest putting the hyperparameter grids aside for now and instead building one boosted tree at a time until you get a sense of how best to configure the XGBoost hyperparameters for your dataset. In your manual optimization, you could try the following if you have not already:
(1) Specify an eval_metric
and plot it for both the training dataset and testing dataset. Assuming this metric is a loss function (i.e., better when smaller), you should stop the training when the eval_metric
for the testing dataset is at or near its global minimum, or at least is not significantly improving. If you are stopping training too early or too late, the prediction accuracy of XGBoost may be poor. For example, the optimal value of n_estimator
may be much less than 100 or much greater than 500, causing you to miss it in your grid search.
(2) If the eval_metric
for the testing dataset has a global minimum, you may be able to achieve (1) more easily and accurately by implementing early stopping with a callback structure. Just look it up if you are interested.
(3) Try smaller values of max_depth
. Counterintuitively, large values of max_depth
, and more generally large trees, sometimes produce less accurate predictions, perhaps because they over-split the data. You can also try setting other parameters that limit tree size and structure: min_child_weight
, min_split_loss
, max_leaves
, and so on. Be sure to make them restrictive enough that they become the size-limiting constraint.
(4) You can use plot_importance
to find the important features and then make a new model trained with only these. The paper "Why do tree-based models still outperform deep learning on tabular data?" by Grinsztajn, Oyallon, and Varoquaux shows that irrelevant features often reduce prediction accuracy, so eliminating these features often raises it.
(5) You can try various schemes of subsampling, bootstrapping, and bagging to increase the robustness of the predictions. As @trivialfis notes, these can include the construction of a random forest within XGBoost itself.
I have not studied the stacking regressor you mention, but it is entirely possible for stacking to yield a less accurate model than one of the combined models. Think of a single prediction: if both predicted values are larger than the true value, then their mean will be between them in value and less accurate than the more accurate one. In practice, stacking often yields more accurate predictions, but without more assumptions this is not guaranteed!
Finally, RandomForest does appear to legitimately outperform a well-configured single-tree XGBoost implementation for some datasets. Examples of this can be found in the paper by Grinsztajn, et al. mentioned above.
Hope this helps.
Hi @6nc0r6-1mp6r0, Thank you so much for your insights and suggestions! I appreciate your help and will implement your suggestions. Regarding the stacking regressor, I appreciate your perspective. It's insightful to consider how the combined predictions might not always lead to better accuracy, especially when both models predict values that are consistently off, I was thinking same but wasn't that much confident.
Hi @trivialfis, Thanks for your valuable inputs, will add them in my code.
This was partly covered in a previous answer, but unless your data is absolutely massive, those max_depths are likely to be far from optimal for xgboost.
Also important to note that if you find max_depth=X is optimal for a RandomForest, you will find that for the same task, the optimal max_depth for xgboost is considerably lower. So while you could run your RandomForest and XGboost over the same grid of MaxDepths, it would need to be quite a big grid. It depends a little on the size of your data, but I'd probably search of MaxDepth=1-6 if your data is fairly small and the bigger it gets, start shifting that to 2-7, 3-8 etc (not an exact science). Conversely, for the RandomForest, if I were going to search over 6 values of MaxDepth, I'd probably (again, depending on the data, you can do a rough calculation to figure out at what depth you're likely to bottom out) search over 6,8,10,12,14,16.
The intuition here is that a random forest is trying to intentionally overfit, with every individual decision tree over-fitting in different ways and this somewhat cancelling out (although you'll still often find that there's some value to not choosing the max depth you possibly can)
XGBoost is trying to intentionally underfit every individual tree and then fit lots of them sequentially to slowly get the loss down that way.
Note that as others have said, it's still possible that even if you do find the optimal hyperparameters, RandomForest will perform better for your particular problem, it's been known to happen, there's no theoretical reason why XGBoost most outperform random forest, although in my experience it almost always does.
Other notes: RandomForests (assume you're using sklearn) let you control complexity via other parameters like min_samples_leaf, which I'd suggest is a better way of doing it, allows for asymmetric trees.
MaxDepth=35 sounds absurdly low to me, unless your data is absolutely massive (like 1 trillion rows). Why is this? 2^35 is like 34 billion. So if a tree grew down to maxdepth 35, you could have as many as 34 billion terminal nodes, and you'd likely want more than 1 datapoint per terminal node. As a good rule of thumb, log2(datasize) is bigger than the maximum max depth you want to probe
Hi @gdubs89, currently my data limits to 4000 only.
4000 is roughly 2^12, so it's unlike that depth in excessive of ~13 will have any effect (that is to say, max_depth=13,14,15,...35) will create the same tree
For such small datasets, I suspect xgboost will perform best at maxDepth=1 or 2.
A generally instructive exercise is to first gridsearch over the MaxDepth parameter for a decision tree under cross-validation. My guess is you'll find the optimal max depth will be 4-8.
Then do the same thing for a random forest (you can have plenty of trees in the forest, as your data is small). You'll find the optimal max depth to be deeper than the decision tree optimal max depth, and you'll find optimal performance is better
Finally, do the same thing for XGBoost (make sure to use a validation set and early stopping). You'll find the optimal maxdepth to be lower (as I said above, I'm guessing 1 or 2) than the decision tree's optimal maxdepth, and you'll also find optimal performance is better than that of the decision tree, and probably better than that of the random forest.
This nicely demonstrates the fundamental difference between boosting and bagging
That said, RandomForest allows you to control complexity via the min_samples_leaf parameter so you could probably still eek out a bit of performance by regularising your random forest in that way rather than via max depth
Hi Viewer,
I am performing predictions using both
XGBoost
andRandom Forest
models on a dataset, but I consistently observe that the Random Forest model achieves betterR²
scores andcorrelation
values compared toXGBoost
, even though I am using extensive hyperparameter tuning for both models. Below are the hyperparameter grids I am using for tuning:Despite trying different combinations of
hyperparameters
, the Random Forest model consistently outperforms theXGBoost
model in terms ofR² score
andcorrelation
.To improve performance, I attempted to use a
stacking regressor
ensemble combining the two models (Random Forest
andXGBoost
). However, surprisingly, the ensemble results are coming less thanRandom Forest
and greater thanXGBoost
.,My questions are:
Random Forest
be performing better thanXGBoost
in this case? Could this be due to the specific nature of my dataset, model architecture, or the hyperparameters used?XGBoost
that I should consider tweaking to enhance its performance in comparison toRandom Forest
?stacking regressor
ensemble not showing better results? In theory, it should combine the strengths of both models and perform better, but that's not happening. Are there any common reasons or mistakes that could lead to this outcome?Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!