dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.34k stars 8.73k forks source link

Bug in `objective = "reg:pseudohubererror"` and `xgb.plot.tree()` #10988

Open DrJerryTAO opened 2 weeks ago

DrJerryTAO commented 2 weeks ago

Hi @mattn, I wanted to use XGBoost for quantile regression but found that the loss function of pseudo Huber error does no better than a null model. Currently, objective = 'reg:pseudohubererror' predicts every case as 0.5, with no information learnt at all no matter how other parameters are specified.

Also, xgb.plot.tree() shows nothing. The Viewer panel is blank.

Further, objective = "reg:quantileerror" results in error although the online documentation mentions it https://xgboost.readthedocs.io/en/latest/parameter.html. I am using the latest R version 1.7.8.1.

library(xgboost)
library(tidyverse)
data(mtcars)
Data <- mtcars %>%
  {xgb.DMatrix(
    data = (.) %>% select(-mpg) %>% as.matrix(), 
    label = (.) %>% pull(mpg))}
Model <- xgboost(
  data = Data, 
  objective = "reg:pseudohubererror", 
  max.depth = 3, eta = 1, nrounds = 100)
"As the log shows, each mean pseudo Hubber error is 18.618537, no changes 
over iteration"
Model <- xgboost(
  data = Data, 
  objective = "reg:pseudohubererror", eval_metric = "mae", 
  max.depth = 3, eta = 1, nrounds = 100)
"mae = 19.590625, no changes over 100 iteration"
mean(abs(mtcars$mpg - 0.5)) # 19.59062
"objective = 'reg:pseudohubererror' predicts every case as 0.5, 
no information learnt at all."
Model <- xgboost(
  data = Data, 
  objective = "reg:quantileerror", eval_metric = "mae", 
  max.depth = 3, eta = 1, nrounds = 100)
"Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  [02:01:16] src/objective/objective.cc:26: 
  Unknown objective function: `reg:quantileerror`
Objective candidate: survival:aft
Objective candidate: binary:hinge
Objective candidate: rank:pairwise
Objective candidate: rank:ndcg
Objective candidate: rank:map
Objective candidate: multi:softmax
Objective candidate: multi:softprob
Objective candidate: reg:squarederror
Objective candidate: reg:squaredlogerror
Objective candidate: reg:logistic
Objective candidate: binary:logistic
Objective candidate: binary:logitraw
Objective candidate: reg:linear
Objective candidate: reg:pseudohubererror
Objective candidate: count:poisson
Objective candidate: survival:cox
Objective candidate: reg:gamma
Objective candidate: reg:tweedie
Objective candidate: reg:absoluteerror"
Model <- xgboost(
  data = Data, 
  objective = "reg:tweedie", eval_metric = "mae", 
  max.depth = 3, eta = 1, nrounds = 4)
xgb.plot.tree(model = Model)
"The Viewer panel shows blank. This is not because my environment has errors."
xgb.plot.importance(importance_matrix = xgb.importance(model = Model))
"If I plot variable importance, I do see a plot in Plots."
hcho3 commented 2 weeks ago

The "reg:quantileerror" objective was added in XGBoost 2.0, which isn't available on CRAN. You should install the R package from the source to use the feature.

DrJerryTAO commented 2 weeks ago

@hcho3 thanks. @kashif @darxriggs Do you know why objective = 'reg:pseudohubererror' does not update over iterations? Could you address xgb.plot.tree() and objective = 'reg:pseudohubererror' bugs? How do we install from the source? I did not see sample codes for R.

DrJerryTAO commented 1 week ago

Hi all, I have found the key. It is about setting the base score. Switching initial prediction from 0.5 to a weakly informative mean() or median() would push the gradient search away from the starting point. It is against my intuition that a prediction far away from the observation should generate large gradients towards the direction that lowers the loss function. And the pseudo Huber loss function is not flat. Unlike in other models, the base_score appears to have a huge impact on the model even in small data sets.

Will the new support of "Intercept" https://xgboost.readthedocs.io/en/latest/tutorials/intercept.html in version 2.0 will solve this problem automatically? I think it is also important to document that the default base_score = 0.5 will work very poorly for objective = "reg:pseudohubererror".

See the impacts when base_score = median() is specified.

# Solution: set base_score
library(xgboost)
library(tidyverse)
data(mtcars)
Data <- mtcars %>%
  {xgb.DMatrix(
    data = (.) %>% select(-mpg) %>% as.matrix(), 
    label = (.) %>% pull(mpg))}
Model <- xgboost(
  data = Data, 
  objective = "reg:pseudohubererror", 
  base_score = median(mtcars$mpg), 
  max.depth = 3, eta = 1, nrounds = 100)
"[1]    train-mphe:2.019801 
[100]   train-mphe:0.000000 "
Model <- xgboost(
  data = Data, 
  objective = "reg:pseudohubererror", eval_metric = "mae", 
  base_score = median(mtcars$mpg), 
  max.depth = 3, eta = 1, nrounds = 100)
"[1]    train-mae:2.685595
[100]   train-mae:0.000482"
trivialfis commented 1 week ago

Thank you for sharing, we will have to do some experiments once the R interface is ready. It's using median by default with the latest XGBoost, so I suspect it should work