inconsistent predictions when 0 predictors had non-zero influence

GoogleCodeExporter commented 9 years ago

There are two issues:

1) setting the seed does not ensure reproducibility of the model
2) when no predictors are used in any splits, the predicted values are somewhat 
inconsistent; sometimes NA values are produced.

> library(gbm)
> library(caret)
> data(mdrr)
> 
> set.seed(1)
> gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),
+                 distribution = "bernoulli")
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           inf             nan     0.0010       nan
     2           inf             nan     0.0010       nan
     3           inf             nan     0.0010       nan
     4           inf             nan     0.0010       nan
     5           inf             nan     0.0010       nan
     6           inf             nan     0.0010       nan
     7           inf             nan     0.0010       nan
     8           inf             nan     0.0010       nan
     9           inf             nan     0.0010       nan
    10           inf             nan     0.0010       nan
    20           inf             nan     0.0010       nan
    40           inf             nan     0.0010       nan
    60           inf             nan     0.0010       nan
    80           inf             nan     0.0010       nan
   100           inf             nan     0.0010       nan

> gbm1
NULL
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 0 had non-zero influence.
> 
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] 0.485376 0.485376 0.485376 0.485376 0.485376 0.485376
> set.seed(1)
> gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),
+                 distribution = "bernoulli")
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.0010       nan
     2           nan             nan     0.0010       nan
     3           nan             nan     0.0010       nan
     4           nan             nan     0.0010       nan
     5           nan             nan     0.0010       nan
     6           nan             nan     0.0010       nan
     7           nan             nan     0.0010       nan
     8           nan             nan     0.0010       nan
     9           nan             nan     0.0010       nan
    10           nan             nan     0.0010       nan
    20           nan             nan     0.0010       nan
    40           nan             nan     0.0010       nan
    60           nan             nan     0.0010       nan
    80           nan             nan     0.0010       nan
   100           nan             nan     0.0010       nan

> gbm1
NULL
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 0 had non-zero influence.
> 
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] NaN NaN NaN NaN NaN NaN

It looks like older versions would produce a non-NA value for all samples (as 
in the top example).

I'm also not sure why no splits would occur in the model. This seems to be 
occurring with a higher frequency than before.

Thanks,

Max

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] caret_5.15-60    reshape2_1.2.1   plyr_1.8        
[4] foreach_1.4.0    cluster_1.14.3   gbm_2.0-9.3     
[7] lattice_0.20-10  survival_2.36-14

loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_2.15.2     iterators_1.0.6 stringr_0.6.1  
[5] tools_2.15.2

Original issue reported on code.google.com by MxK...@gmail.com on 6 Feb 2013 at 3:54

GoogleCodeExporter commented 9 years ago

One other note: this does not seem to occur when called using gbm()

> dat <- mdrrDescr[, 1:20]
> dat$y <- ifelse(mdrrClass == "Active", 1, 0)
> set.seed(1)
> gbm1 <- gbm(y ~ ., data = dat,,
+             distribution = "bernoulli")
> gbm1
gbm(formula = y ~ ., distribution = "bernoulli", data = dat)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 5 had non-zero influence.
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] 0.5404099 0.5619192 0.5301132 0.5255829 0.5619192 0.5371400
> 
> 
> count <- 0
> for(i in 1:100)
+ {  
+   gbm1 <- gbm(y ~ ., data = dat, distribution = "bernoulli")
+   prd <- predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
+   if(any(is.na(prd))) count <- count + 1
+ }
> count
[1] 0

Original comment by MxK...@gmail.com on 6 Feb 2013 at 4:39

GoogleCodeExporter commented 9 years ago

I also come cross such problem.

Original comment by guoshich...@gmail.com on 20 Feb 2014 at 6:28

GoogleCodeExporter commented 9 years ago

it will be ok when I replace bernoulli with gaussian

gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),         
 distribution = "bernoulli")

Original comment by guoshich...@gmail.com on 20 Feb 2014 at 6:29

t-g-williams commented 5 years ago

I was having similar issues, and it seemed that converting the response to a "logical" solved the issue. If the data is called d and the response is called y:

d$y <- as.logical(asinteger(d$y)-1)

Then run the gbm function with this.

carolssnz / gradientboostedmodels

inconsistent predictions when 0 predictors had non-zero influence #13