brian-j-smith / MachineShop

MachineShop: R package of models and tools for machine learning
https://brian-j-smith.github.io/MachineShop/
62 stars 10 forks source link

Documentation about outer loop in nested resampling #7

Open lang-benjamin opened 1 year ago

lang-benjamin commented 1 year ago

Maybe I missed it, but I could not find documentation about how the outer loop is being performed when nested resampling is done. I assume the inner loop is defined via one of the algorithms in section 9.1 in the user guide, e.g. via BootOptimismControl()or CVControl(). How can the outer loop be controlled or looked up?

brian-j-smith commented 1 year ago

Hi @langb,

The resampling algorithms are defined by the control arguments of the functions that have them (e.g. resample() and ModelSpecification()). For instance, in the code below cross-validation is used with 5 folds for model tuning in the inner-resampling loop and with 10 folds for estimation of model performance in the outer loop. All of the functions have the same default control argument value of MachineShop::settings("control") which can be changed globally. Otherwise, the control arguments can be set individually, like below, to have different algorithms at different nesting levels. See ?MachineShop::controls in R for the full package documentation on the control functions.

##  Specify inputs and model
modelspec <- ModelSpecification(
  sale_amount ~ .,
  data = ICHomes,
  model = TunedModel(GBMModel),
  control = CVControl(folds = 5)
)

## See model specification details, including the inner resampling algorithm
print(modelspec)

## Estimate model performance
res <- resample(
  modelspec,
  control = CVControl(folds = 10)
)

## See performance details, including the outer resampling algorithm
print(res)

## Performance summary statistics
summary(res)
lang-benjamin commented 1 year ago

Ah, I see! Thanks for the explanation (and for the great package).

lang-benjamin commented 1 year ago

One more question on this topic. Using the code above, print(modelspec) will show information on the resampling algorithm. However, when I use XGBModel instead of GBMModel, i.e. model = TunedModel(XGBModel), the info about the resampling algorithm does not show up:

library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)

##  Specify inputs and model
modelspec <- ModelSpecification(
  sale_amount ~ .,
  data = ICHomes,
  model = TunedModel(XGBModel),
  control = CVControl(folds = 5)
)

## See model specification details, including the inner resampling algorithm
print(modelspec)
#> --- ModelSpecification object --------------------------------------------------
#> 
#> === ModelFrame object ===
#> 
#> Terms: sale_amount ~ sale_year + sale_month + built + style + construction +
#>   base_size + add_size + garage1_size + garage2_size + lot_size + bedrooms +
#>   basement + ac + attic + lon + lat
#> Number of observations: 753
#> 
#> === MLModel object ===
#> 
#> Model name: XGBModel
#> Label: Extreme Gradient Boosting

Created on 2023-04-28 with reprex v2.0.2

Using model = TunedModel(XGBModel, grid = TuningGrid(size = 100, random = 10)) does also not work, but using a fixed grid works:

library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)

grid <- expand_params(nrounds = c(25, 50),
                      max_depth = c(1, 2),
                      eta = c(0.01, 0.02),
                      subsample = c(0.2, 0.4))

##  Specify inputs and model
modelspec <- ModelSpecification(
  sale_amount ~ .,
  data = ICHomes,
  model = TunedModel(XGBModel, grid = grid),
  control = CVControl(folds = 5)
)

## See model specification details, including the inner resampling algorithm
print(modelspec)
#> --- ModelSpecification object --------------------------------------------------
#> 
#> === ModelFrame object ===
#> 
#> ID: input.Qq98
#> Terms: sale_amount ~ sale_year + sale_month + built + style + construction +
#>   base_size + add_size + garage1_size + garage2_size + lot_size + bedrooms +
#>   basement + ac + attic + lon + lat
#> Number of observations: 753
#> 
#> === MLModel object ===
#> 
#> Model name: XGBModel
#> Label: Extreme Gradient Boosting
#> ID: model.2J8G
#> 
#> === Grid ===
#> 
#> # A tibble: 16 × 1
#>    model.2J8G$nrounds $max_depth  $eta $subsample
#>                 <dbl>      <dbl> <dbl>      <dbl>
#>  1                 25          1  0.01        0.2
#>  2                 25          1  0.01        0.4
#>  3                 25          1  0.02        0.2
#>  4                 25          1  0.02        0.4
#>  5                 25          2  0.01        0.2
#>  6                 25          2  0.01        0.4
#>  7                 25          2  0.02        0.2
#>  8                 25          2  0.02        0.4
#>  9                 50          1  0.01        0.2
#> 10                 50          1  0.01        0.4
#> # … with 6 more rows
#> 
#> === TrainingParams object ===
#> 
#> ... GridSearch object
#> Label: Grid Search
#> 
#> ... CVControl object
#> Label: K-Fold Cross-Validation
#> Folds: 5
#> Repeats: 1

Created on 2023-04-28 with reprex v2.0.2

Am I doing something wrong?

brian-j-smith commented 1 year ago

The print() method in MachineShop accepts a level argument that controls the amount of information displayed, with negative values showing more and positive values less. The information can grow quickly with model complexity. So, the level argument there is to expand or contract the amount displayed.

library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)

grid <- expand_params(
  nrounds = c(25, 50),
  max_depth = c(1, 2),
  eta = c(0.01, 0.02),
  subsample = c(0.2, 0.4)
)

modelspec <- ModelSpecification(
  sale_amount ~ .,
  data = ICHomes,
  model = TunedModel(XGBModel, grid = grid),
  control = CVControl(folds = 5)
)

?MachineShop::print

## See increasingly more model details
print(modelspec, level = -1)
print(modelspec, level = -2)
lang-benjamin commented 1 year ago

Thanks! But even if I set level = -10, I do not see more info on the resampling algorithm when XGBModel is being used. Also, using the following model specification, it should take quite some time to run. It runs, however, in less than one second. I have the impression that tuning and resampling arguments are somehow ignored?

library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)

##  Specify inputs and model
modelspec <- ModelSpecification(
  sale_amount ~ .,
  data = ICHomes,
  model = TunedModel(XGBModel, grid = TuningGrid(size = 500, random = 100)), 
  control = BootOptimismControl(samples = 100)
)

## See model specification details, including the inner resampling algorithm
print(modelspec, level = -10)
#> --- ModelSpecification object --------------------------------------------------
#> 
#> === ModelFrame object ==========================================================
#> 
#> Terms: sale_amount ~ sale_year + sale_month + built + style + construction +
#>   base_size + add_size + garage1_size + garage2_size + lot_size + bedrooms +
#>   basement + ac + attic + lon + lat
#> Data:
#>    sale_amount sale_year sale_month built style      construction base_size
#> 1        90000      2005          1  2001 Condo     1 Story Frame       878
#> 2       168500      2005          1  1976  Home Split Foyer Frame      1236
#> 3       205000      2005          1  1995  Home Split Foyer Frame      1466
#> 4       121000      2005          1  2001 Condo     1 Story Condo      1150
#> 5       215000      2005          1  1974  Home     2 Story Frame       936
#> 6       278000      2005          2  1991  Home     2 Story Frame       936
#> 7       170000      2005          2  1977  Home Split Foyer Frame      1220
#> 8       290000      2005          2  1920  Home     2 Story Frame       985
#> 9       185000      2005          2  1993  Home     2 Story Frame       914
#> 10      109900      2005          2  1955  Home     1 Story Frame       864
#>    add_size garage1_size garage2_size
#> 1         0            0          264
#> 2         0          576            0
#> 3         0            0            0
#> 4         0            0          528
#> 5       376          572            0
#> 6       384          528            0
#> 7         0            0            0
#> 8       356            0            0
#> 9         0          440            0
#> 10        0          240            0
#> ... with 743 more rows and 8 more columns:lot_size, bedrooms, basement, ac,
#> attic, lon, lat, (strata)
#> 
#> === MLModel object =============================================================
#> 
#> Model name: XGBModel
#> Label: Extreme Gradient Boosting
#> Package: xgboost (>= 1.3.0)
#> Response types: factor, numeric, PoissonVariate, Surv
#> Case weights support: TRUE
#> Missing case removal: response
#> Tuning grid: FALSE
#> Variable importance: TRUE
#> 
#> Parameters:
#> List of 6
#>  $ nrounds                    : num 100
#>  $ aft_loss_distribution      : chr "normal"
#>  $ aft_loss_distribution_scale: num 1
#>  $ base_score                 : num 0.5
#>  $ verbose                    : num 0
#>  $ print_every_n              : num 1

system.time({
  res <- resample(
    modelspec,
    control = CVControl(folds = 3)
  )
})
#>        User      System   elapsed 
#>       0.842       0.022       0.864

Created on 2023-04-28 with reprex v2.0.2

brian-j-smith commented 1 year ago

TuningGrid() is for models that have an automated tuning grid, which XGBModel does not have (Tuning Grid: FALSE). Either a manual grid needs to be supplied, as in your previous example, or one of XGBTreeModel (recommended), XGBLinearModel, or XGBDARTModel (very slow) should be used. Those three submodels all have automated grids.