Open lang-benjamin opened 1 year ago
Hi @langb,
The resampling algorithms are defined by the control
arguments of the functions that have them (e.g. resample()
and ModelSpecification()
). For instance, in the code below cross-validation is used with 5 folds for model tuning in the inner-resampling loop and with 10 folds for estimation of model performance in the outer loop. All of the functions have the same default control argument value of MachineShop::settings("control")
which can be changed globally. Otherwise, the control arguments can be set individually, like below, to have different algorithms at different nesting levels. See ?MachineShop::controls
in R for the full package documentation on the control functions.
## Specify inputs and model
modelspec <- ModelSpecification(
sale_amount ~ .,
data = ICHomes,
model = TunedModel(GBMModel),
control = CVControl(folds = 5)
)
## See model specification details, including the inner resampling algorithm
print(modelspec)
## Estimate model performance
res <- resample(
modelspec,
control = CVControl(folds = 10)
)
## See performance details, including the outer resampling algorithm
print(res)
## Performance summary statistics
summary(res)
Ah, I see! Thanks for the explanation (and for the great package).
One more question on this topic. Using the code above, print(modelspec)
will show information on the resampling algorithm. However, when I use XGBModel
instead of GBMModel
, i.e. model = TunedModel(XGBModel)
, the info about the resampling algorithm does not show up:
library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)
## Specify inputs and model
modelspec <- ModelSpecification(
sale_amount ~ .,
data = ICHomes,
model = TunedModel(XGBModel),
control = CVControl(folds = 5)
)
## See model specification details, including the inner resampling algorithm
print(modelspec)
#> --- ModelSpecification object --------------------------------------------------
#>
#> === ModelFrame object ===
#>
#> Terms: sale_amount ~ sale_year + sale_month + built + style + construction +
#> base_size + add_size + garage1_size + garage2_size + lot_size + bedrooms +
#> basement + ac + attic + lon + lat
#> Number of observations: 753
#>
#> === MLModel object ===
#>
#> Model name: XGBModel
#> Label: Extreme Gradient Boosting
Created on 2023-04-28 with reprex v2.0.2
Using model = TunedModel(XGBModel, grid = TuningGrid(size = 100, random = 10))
does also not work, but using a fixed grid works:
library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)
grid <- expand_params(nrounds = c(25, 50),
max_depth = c(1, 2),
eta = c(0.01, 0.02),
subsample = c(0.2, 0.4))
## Specify inputs and model
modelspec <- ModelSpecification(
sale_amount ~ .,
data = ICHomes,
model = TunedModel(XGBModel, grid = grid),
control = CVControl(folds = 5)
)
## See model specification details, including the inner resampling algorithm
print(modelspec)
#> --- ModelSpecification object --------------------------------------------------
#>
#> === ModelFrame object ===
#>
#> ID: input.Qq98
#> Terms: sale_amount ~ sale_year + sale_month + built + style + construction +
#> base_size + add_size + garage1_size + garage2_size + lot_size + bedrooms +
#> basement + ac + attic + lon + lat
#> Number of observations: 753
#>
#> === MLModel object ===
#>
#> Model name: XGBModel
#> Label: Extreme Gradient Boosting
#> ID: model.2J8G
#>
#> === Grid ===
#>
#> # A tibble: 16 × 1
#> model.2J8G$nrounds $max_depth $eta $subsample
#> <dbl> <dbl> <dbl> <dbl>
#> 1 25 1 0.01 0.2
#> 2 25 1 0.01 0.4
#> 3 25 1 0.02 0.2
#> 4 25 1 0.02 0.4
#> 5 25 2 0.01 0.2
#> 6 25 2 0.01 0.4
#> 7 25 2 0.02 0.2
#> 8 25 2 0.02 0.4
#> 9 50 1 0.01 0.2
#> 10 50 1 0.01 0.4
#> # … with 6 more rows
#>
#> === TrainingParams object ===
#>
#> ... GridSearch object
#> Label: Grid Search
#>
#> ... CVControl object
#> Label: K-Fold Cross-Validation
#> Folds: 5
#> Repeats: 1
Created on 2023-04-28 with reprex v2.0.2
Am I doing something wrong?
The print()
method in MachineShop accepts a level
argument that controls the amount of information displayed, with negative values showing more and positive values less. The information can grow quickly with model complexity. So, the level
argument there is to expand or contract the amount displayed.
library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)
grid <- expand_params(
nrounds = c(25, 50),
max_depth = c(1, 2),
eta = c(0.01, 0.02),
subsample = c(0.2, 0.4)
)
modelspec <- ModelSpecification(
sale_amount ~ .,
data = ICHomes,
model = TunedModel(XGBModel, grid = grid),
control = CVControl(folds = 5)
)
?MachineShop::print
## See increasingly more model details
print(modelspec, level = -1)
print(modelspec, level = -2)
Thanks! But even if I set level = -10
, I do not see more info on the resampling algorithm when XGBModel
is being used. Also, using the following model specification, it should take quite some time to run. It runs, however, in less than one second. I have the impression that tuning and resampling arguments are somehow ignored?
library(MachineShop, warn.conflicts = FALSE)
data(ICHomes)
## Specify inputs and model
modelspec <- ModelSpecification(
sale_amount ~ .,
data = ICHomes,
model = TunedModel(XGBModel, grid = TuningGrid(size = 500, random = 100)),
control = BootOptimismControl(samples = 100)
)
## See model specification details, including the inner resampling algorithm
print(modelspec, level = -10)
#> --- ModelSpecification object --------------------------------------------------
#>
#> === ModelFrame object ==========================================================
#>
#> Terms: sale_amount ~ sale_year + sale_month + built + style + construction +
#> base_size + add_size + garage1_size + garage2_size + lot_size + bedrooms +
#> basement + ac + attic + lon + lat
#> Data:
#> sale_amount sale_year sale_month built style construction base_size
#> 1 90000 2005 1 2001 Condo 1 Story Frame 878
#> 2 168500 2005 1 1976 Home Split Foyer Frame 1236
#> 3 205000 2005 1 1995 Home Split Foyer Frame 1466
#> 4 121000 2005 1 2001 Condo 1 Story Condo 1150
#> 5 215000 2005 1 1974 Home 2 Story Frame 936
#> 6 278000 2005 2 1991 Home 2 Story Frame 936
#> 7 170000 2005 2 1977 Home Split Foyer Frame 1220
#> 8 290000 2005 2 1920 Home 2 Story Frame 985
#> 9 185000 2005 2 1993 Home 2 Story Frame 914
#> 10 109900 2005 2 1955 Home 1 Story Frame 864
#> add_size garage1_size garage2_size
#> 1 0 0 264
#> 2 0 576 0
#> 3 0 0 0
#> 4 0 0 528
#> 5 376 572 0
#> 6 384 528 0
#> 7 0 0 0
#> 8 356 0 0
#> 9 0 440 0
#> 10 0 240 0
#> ... with 743 more rows and 8 more columns:lot_size, bedrooms, basement, ac,
#> attic, lon, lat, (strata)
#>
#> === MLModel object =============================================================
#>
#> Model name: XGBModel
#> Label: Extreme Gradient Boosting
#> Package: xgboost (>= 1.3.0)
#> Response types: factor, numeric, PoissonVariate, Surv
#> Case weights support: TRUE
#> Missing case removal: response
#> Tuning grid: FALSE
#> Variable importance: TRUE
#>
#> Parameters:
#> List of 6
#> $ nrounds : num 100
#> $ aft_loss_distribution : chr "normal"
#> $ aft_loss_distribution_scale: num 1
#> $ base_score : num 0.5
#> $ verbose : num 0
#> $ print_every_n : num 1
system.time({
res <- resample(
modelspec,
control = CVControl(folds = 3)
)
})
#> User System elapsed
#> 0.842 0.022 0.864
Created on 2023-04-28 with reprex v2.0.2
TuningGrid()
is for models that have an automated tuning grid, which XGBModel
does not have (Tuning Grid: FALSE
). Either a manual grid needs to be supplied, as in your previous example, or one of XGBTreeModel
(recommended), XGBLinearModel
, or XGBDARTModel
(very slow) should be used. Those three submodels all have automated grids.
Maybe I missed it, but I could not find documentation about how the outer loop is being performed when nested resampling is done. I assume the inner loop is defined via one of the algorithms in section 9.1 in the user guide, e.g. via
BootOptimismControl()
orCVControl()
. How can the outer loop be controlled or looked up?