boost-R / gamboostLSS

Boosting models for fitting generalized additive models for location, shape and scale (GAMLSS) to potentially high dimensional data. The current relase version can be found on CRAN (https://cran.r-project.org/package=gamboostLSS).
26 stars 11 forks source link

predict for (some) noncyclic models broken #48

Closed hofnerb closed 6 years ago

hofnerb commented 6 years ago

Spotted by Lisa Schlosser:

library("gamboostLSS")
set.seed(7)
data <- data.frame(x1 = runif(400, -1, 1),
                   x2 = runif(400, -1, 1),
                   x3 = runif(400, -1, 1),
                   x4 = runif(400, -1, 1))
data$y <- rnorm(400, 4 * data$x1 + 2)

gb_oob <- gamboostLSS(y ~ x1, data = data,
                      control = boost_control(risk = "oobag"), method = "noncyclic")
## works:
predict(gb_oob, type = "response", parameter = c("mu", "sigma"))

## error:
predict(gb_oob, type = "response", parameter = c("mu", "sigma"), newdata = data)

## works but seems wrong as sigma has 5 elements instead of 4:
predict(gb_oob, type = "response", parameter = c("mu", "sigma"), newdata = data[c(1:4),])

This only happens with risk = "oobag" and method = "noncyclic".

hofnerb commented 6 years ago

The question is why the prediction is influenced by the type of risk. It seems that the question is more related to the handling of risk in the selection and evaluation of base-learners.

set.seed(7)
data <- data.frame(x1 = runif(400, -1, 1),
                   x2 = runif(400, -1, 1),
                   x3 = runif(400, -1, 1),
                   x4 = runif(400, -1, 1))
data$y <- rnorm(400, 4 * data$x1 + 2)

gb_oob <- gamboostLSS(y ~ x1, data = data,
                      control = boost_control(risk = "oobag"), method = "noncyclic")
risk(gb_oob)
# $mu
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [68] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#
# $sigma
# [1] 0
selected(gb_oob)
# $mu
#   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  [68] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# 
# $sigma
# NULL

gb_inbag <- gamboostLSS(y ~ x1, data = data,
                        control = boost_control(risk = "inbag"), method = "noncyclic")
risk(gb_inbag)
# $mu
#  [1] 949.9624 898.9582 895.6406 892.4078 889.2578 882.8732 879.7552 876.7193 873.7634 867.5493 864.5580 861.6487 855.5644 852.5942 849.7092
# [16] 843.6906 840.7283 837.8551 831.8645 828.9039 826.0369 820.0512 817.0910 814.2295 808.2349 805.2780 802.4249 796.4144 793.4665 790.6278
# [31] 784.5994 781.6687 778.8525 772.8086 769.9051 767.1214 761.0682 758.2035 755.4635 749.4106 746.5972 740.7360 737.8726 735.1477 729.2749
# [46] 726.5057 723.8772 718.0157 715.3498 709.6802 707.0003 704.4707 698.8211 696.2814 690.8209 688.2926 683.0020 680.5043 675.3718 672.9219
# [61] 667.9414 665.5541 663.3340 658.4068 656.2402 651.4995 649.4003 644.8441 642.8241 638.4542
# 
# $sigma
#  [1] 949.9624 934.2057 923.8818 916.3281 910.5219 905.9646 902.3630 886.0757 870.6250 858.6224 846.7449 834.9220 823.1134 811.2995 799.4758
# [16] 787.6494 775.8371 764.0634 752.3596 743.7451 732.1927 720.8324 712.5201 701.5200 693.5159 685.6723 677.9991 670.5096 660.7443 653.7715
# [31] 647.0375 640.5578

selected(gb_inbag)
# $mu
#  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [69] 1
# 
# $sigma
#  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

gb_none <- gamboostLSS(y ~ x1, data = data,
                       control = boost_control(risk = "none"), method = "noncyclic")
risk(gb_none)
# $mu
#  [1] 949.9624 898.9582 895.6406 892.4078 889.2578 882.8732 879.7552 876.7193 873.7634 867.5493 864.5580 861.6487 855.5644 852.5942 849.7092
# [16] 843.6906 840.7283 837.8551 831.8645 828.9039 826.0369 820.0512 817.0910 814.2295 808.2349 805.2780 802.4249 796.4144 793.4665 790.6278
# [31] 784.5994 781.6687 778.8525 772.8086 769.9051 767.1214 761.0682 758.2035 755.4635 749.4106 746.5972 740.7360 737.8726 735.1477 729.2749
# [46] 726.5057 723.8772 718.0157 715.3498 709.6802 707.0003 704.4707 698.8211 696.2814 690.8209 688.2926 683.0020 680.5043 675.3718 672.9219
# [61] 667.9414 665.5541 663.3340 658.4068 656.2402 651.4995 649.4003 644.8441 642.8241 638.4542
# 
# $sigma
#  [1] 949.9624 934.2057 923.8818 916.3281 910.5219 905.9646 902.3630 886.0757 870.6250 858.6224 846.7449 834.9220 823.1134 811.2995 799.4758
# [16] 787.6494 775.8371 764.0634 752.3596 743.7451 732.1927 720.8324 712.5201 701.5200 693.5159 685.6723 677.9991 670.5096 660.7443 653.7715
# [31] 647.0375 640.5578

selected(gb_none)
# $mu
#  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [69] 1
# 
# $sigma
#  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Obviously, as we do not have oou-of-bag data, risk = "oobag" does some weird job. "inbag" and "none" do behave identically.

The question is how we treat the risk-attribute in the selection of base-learners. Do we always use inbag? or do we respect the risk? In that case, none should throw an error (?) and oobag should only work if we have some out-of-bag observations (i.e., observations with weight = 0). I would tend to disrespect the risk when we select the best-fitting component as we also only use inbag data to compute the RSS.

hofnerb commented 6 years ago

The second, related problem could be that we want to predict for a model component which was never selected. It seems that there also happens to be a bug.

ja-thomas commented 6 years ago

Thanks for reporting,

so, mboost predict does something strange when predicting with an empty model:

library(mboost)
m = glmboost(speed ~ dist, data = cars, control = boost_control(mstop = 0))
predict(m, type = "response", newdata = cars)

Error in names(pr) <- nm : 
  'names' attribute [50] must be the same length as the vector [2]

EDIT: It seems that one element per base learner is predicted when mstop=0, that explains why the vector length is 5 in the first example, and in this example 2.

hofnerb commented 6 years ago

Ok, so (part of) the error seems to be hidden in mboost. I'll open an issue there as well.

However, I do not understand the EDIT. What do you mean by "element per baselearner"? We have one base-learner in each of the two models above (with the only difference that we have a linear or smooth baselerner).

hofnerb commented 6 years ago

@ja-thomas as you already seem to know the error and as I do not understand your comment, can you please provide a patch for this bug? That would be really great!

hofnerb commented 6 years ago

this was actually a bug in mboost not gamboostLSS. is fixed now.