fstermann / mlr-mini

MIT License
0 stars 2 forks source link

Resampling #10

Closed m-muecke closed 1 year ago

m-muecke commented 1 year ago

Tasks

Description

As described above, one would usually not want to measure the performance of an Inducer by looking at its predictions made on data present in the training dataset. Instead, we want to (repeatedly) split data into a training and a validation set, train on one of these and predict on the other. We represent the abstract way of splitting data as a Split object, and the concrete way in which a dataset is split as a SplitInstance.

identical(splt$cv, SplitCV)
#> [1] TRUE

cv5 <- splt$cv(folds = 5)

class(cv5)
#> [1] "SplitCV" "Split"

cars.split <- cv5(cars.data)

cars.split
#> CV Split Instance of the "cars" dataset (50 rows)
#> Configuration: folds = 5

class(cars.split)
#> [2] "SplitInstanceCV" "SplitInstance"

length(cars.split)
#> [1] 5

cars.split[[1]]
#> $training
#>  [1] 12 21  9  8 37 44  1 47 42 46
#> 
#> $validation
#>  [1] 38 39 32 30 18  3 28 29 14 48 10 25 41  5 27 22 17 24 36 31  6 45 33 15 23
#> [26] 35 43  7 34  2 20 40 16 11 50 26 13  4 49 19

Note that cars.split could be a list with a class and attributes where [[ works as normal, but it could also be an S3 class that implements [[.SplitInstanceCV. The latter would save some memory, since cars.split implemented as a list needs to contain 250 numbers (5 $training of length 10, 5 $validation of length 40), while the information contained in it could also be saved with 100 numbers or even less (50 shuffled row indices, and 50 numbers indicating the CV fold 1, 2, 3, 4 or 5).

The Split object can now be used to evaluate the performance of an Inducer on a Dataset.

We implement the resample() method that creates a ResamplePrediction. An Evaluator can then be applied to this object.

rp <- resample(cars.data, xgb, cv5)
## alternatively:
# rp <- resample(cars.data, xgb, cars.split)

mae(rp)
#> [1] 3.9