dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.29k stars 8.73k forks source link

Poisson Loss in Random Forest Mode #5423

Open mayer79 opened 4 years ago

mayer79 commented 4 years ago

With MSE loss, the random forest mode seems to work well. However, when switching to "count:poisson" (and also "reg:gamma") loss, the model is completely off. The distribution of the predictions is heavily biased.

In this example, the R-squared (with respect to MSE as well as Poisson loss) drops from 70%-80% to 0%. max_delta_step has a large impact.

set.seed(1)
n <- 1000
x1 <- seq_len(n)
x2 <- rnorm(n)
X <- cbind(x1, x2)
y <- rpois(n, x1 / 1000 + x2^2)

library(xgboost)
library(MetricsWeighted)

dtrain_xgb <- xgb.DMatrix(X, label = y)

#=======================================================
# WITH MSE loss
#=======================================================

# xgboost with random forest like parameters
param <- list(max_depth = 10,
              learning_rate = 1,
              objective = "reg:linear",
              subsample = 0.63,
              lambda = 0,
              alpha = 0,
              colsample_bylevel = 1/3)

fit_xgb <- xgb.train(param,
                     dtrain_xgb,
                     watchlist = list(train = dtrain_xgb),
                     nrounds = 1,
                     num_parallel_tree = 500)

pred_mse <- predict(fit_xgb, X)

summary(pred_mse)
# Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# 0.01976  0.69959  1.06019  1.51380  1.74806 11.24593

# MSE-based R-squared
r_squared(y, pred_mse) # 0.8228559

# Poisson-based R-squared
r_squared_poisson(y, pred_mse) # 0.6900247

#=======================================================
# WITH Poisson loss
#=======================================================

# xgboost with random forest like parameters
param <- list(max_depth = 20,
              learning_rate = 1,
              objective = "count:poisson",
              subsample = 0.63,
              lambda = 0,
              alpha = 0,
             # max_delta_step = 0.7,
              colsample_bylevel = 1/3)

fit_xgb <- xgb.train(param,
                     dtrain_xgb,
                     watchlist = list(train = dtrain_xgb),
                     nrounds = 1,
                     num_parallel_tree = 500)

pred_poi <- predict(fit_xgb, X)

summary(pred_poi)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.3325  0.6366  0.8874  0.8054  0.9949  1.0069 

# MSE-based R-squared
r_squared(y, pred_poi) # -0.02036563

# Poisson-based R-squared
r_squared_poisson(y, pred_poi) # 0.01646716

sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 18362)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_Switzerland.1252  LC_CTYPE=English_Switzerland.1252   
# [3] LC_MONETARY=English_Switzerland.1252 LC_NUMERIC=C                        
# [5] LC_TIME=English_Switzerland.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] MetricsWeighted_0.5.0 xgboost_0.90.0.2     
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.6.1    magrittr_1.5      Matrix_1.2-17     tools_3.6.1       stringi_1.4.6    
# [6] grid_3.6.1        data.table_1.12.8 lattice_0.20-38  
trivialfis commented 4 years ago

Maybe the max_delta_step has something to do with the convergence. I'm not sure if it's like the proof for softmax function.

mayer79 commented 4 years ago

In another example with different data and parameters, the result is not too different from LightGBM's rf mode, but ironically worse than using ranger with MSE loss. Poisson deviance should be low, r_squared is the proportion of deviance explained (should be high). These values are calculated on an independent validation data set. So in this particular second example, there does not seem a problem. But what could be the reason for the results in the first example?

#             pred_xgb   pred_lgb  pred_ranger
# deviance  1.05753011 1.04356883  1.02037717
# r_squared 0.04407729 0.05669717  0.07766058

The code is in https://github.com/mayer79/random_forest_benchmark/blob/master/r/poisson.R

mayer79 commented 4 years ago

With slightly other data structure but otherwise quite similar parameters, I get again bad results (an R-squared of -165% on the validation data). Training stops very fast compared to lightGBM (3 instead of 10 seconds). This is with the third commit in above link.

#            pred_xgb   pred_lgb pred_ranger
# deviance   3.303889 1.17156879  1.13913418
# r_squared -1.655372 0.05839744  0.08446549

Hmm.

RAMitchell commented 4 years ago

In random forest mode we can only perform one Newton step to minimise the objective function. In the case of squared error a single step is sufficient. This might be the source of bias. I have thought about how to fix this, maybe by refreshing the trees in subsequent boosting steps.

mayer79 commented 4 years ago

This might be indeed the reason why performance is not too good. But I guess lightGBM's rf mode would suffer the same issue. However, its performance is not negative in all cases I have tested. I will try to test more delta_step values to see if it is just a poor parametrization by myself.