gbm-developers / gbm3

Gradient boosted models
133 stars 117 forks source link

Wrong predictions with quantile distribution #149

Open khotilov opened 6 years ago

khotilov commented 6 years ago

I tried to switch from gbm to gbm3 for quantile regression, but I saw some wrong quantile prediction results from gbm3. See the example below, using the attached toy data X.zip

library(data.table)

get_tree <- function(g) {
  t <- if (class(g) == "gbm") gbm::pretty.gbm.tree(g)
       else gbm3::pretty_gbm_tree(g)
  t$Node <- as.integer(rownames(t))
  mis <- t$MissingNode + 1
  t <- t[-mis,]
  t$MissingNode <- NULL
  t$RealPrediction <- t$Prediction + g$initF
  t
}

X <- readRDS('X.rds')
str(X)

params <- list(y ~ ., data=X, distribution = list(name="quantile", alpha=0.9), n.trees = 1,
               interaction.depth = 2, n.minobsinnode = 250, shrinkage = 1, bag.fraction = 1)
g0 <- do.call(gbm::gbm, params)
g3 <- do.call(gbm3::gbm, params)

get_tree(g0)
get_tree(g3)

# overall 90% quantile:
quantile(X$y, 0.9, type = 2)
# true 90% quantiles inside the splits:
X[, quantile(y, 0.9, type = 2), .(s1 = a<14.5, s2 = a>=14.5 & b < 0.133)]

# Check the empirical CDF's in the 4th node for g0 and g3:
X[a>=14.5 & b >= 0.133, ecdf(y)(c(1.716003, 1.529294))]

The output of it is

> get_tree(g0)
  SplitVar SplitCodePred LeftNode RightNode ErrorReduction Weight  Prediction Node RealPrediction
0        0    14.5000000        1         2      0.6721752   1567  0.03466136    0       1.984051
1       -1     0.1333954       -1        -1      0.0000000    854  0.13339536    1       2.082785
2        1     0.5000000        3         4      1.4643795    713 -0.08359787    2       1.865792
3       -1     0.1578200       -1        -1      0.0000000    273  0.15781996    3       2.107210
4       -1    -0.2333867       -1        -1      0.0000000    440 -0.23338666    4       1.716003
> get_tree(g3)
  SplitVar SplitCodePred LeftNode RightNode ErrorReduction Weight   Prediction Node RealPrediction
0        0    14.5000000        1         2      0.6721752   1567  0.009314278    0       1.958704
1       -1     0.1333954       -1        -1      0.0000000    854  0.133395364    1       2.082785
2        1     0.5000000        3         4      1.4643795    713 -0.112242207    2       1.837148
3       -1     0.1578200       -1        -1      0.0000000    273  0.157819963    3       2.107210
4       -1    -0.4200960       -1        -1      0.0000000    440 -0.420095993    4       1.529294
> 
> # overall 90% quantile:
> quantile(X$y, 0.9, type = 2)
    90% 
1.94939 
> # true 90% quantiles inside the splits:
> X[, quantile(y, 0.9, type = 2), .(s1 = a<14.5, s2 = a>=14.5 & b < 0.133)]
      s1    s2       V1
1: FALSE FALSE 1.716003
2: FALSE  TRUE 2.107210
3:  TRUE FALSE 2.082785
> 
> # Check the empirical CDF's in the 4th node for g0 and g3:
> X[a>=14.5 & b >= 0.133, ecdf(y)(c(1.716003, 1.529294))]
[1] 0.8909091 0.5727273

While the true 90% quantiles inside the splits match the leaves from g0 spot on, the node 4 leaf in g3 is wrong, and it corresponds to a 57% empirical quantile.