Open JaakobKind opened 4 years ago
Is there any "defacto" functions for doing CV on R?
I'm asking because if there are, we can provide tight integration with them and don't have to maintain one ourselves. I'm sure a defacto implementation can do a better job than Xgboost.
I'm not aware of a lightweight library in R that is the "defacto" for doing cross validation in R, equivalent to GridSearchCV
from scikit-learn
in Python.
The closest might be the CV options available from {caret}
(like in this blog post and this other blogpost), but it's not as clear a winner in R world as scikit-learn
is in Python.
In LightGBM
we ended up implementing our own (https://github.com/microsoft/LightGBM/blob/1f3e72c43ca8485eeba988738ecb0e977c7977f1/R-package/R/lgb.cv.R).
This one is part of tidymodels which is a large community for modeling packages in R: https://github.com/tidymodels/tune
In order to get fold indices, I have recently put on CRAN a lightweight R package without any dependency for this purpose: splitTools.
It supports
has a flexible output interface and is optimized for speed. Maybe it would be an alternative to heavier dependencies such as caret.
In the situation of the OP
library(splitTools)
targets_train = c(0,0,0,0,1,1,1,1)
folds <- create_folds(targets_train, k = 2, type = "stratified", seed = 1)
for (fold in folds) {
cat("\n\nIndices:", fold)
cat("\nValues:", targets_train[fold])
}
# Results
Indices: 2 3 6 7
Values: 0 0 1 1
Indices: 1 4 5 8
Values: 0 0 1 1
A technical problem: With stratified splitting, regression labels should be treated differently than classification labels. For regression labels, usually quantile splitting is done, while for classification, no preprocessing is required. Since XGBoost's response is numeric for both regression and classification tasks, this can cause unclear situations. The logic in splitTools
is to first check for the number of distinct values in the response. If it is larger than n_bins = 10
, it does quantile binning. Otherwise, it does no binning. The technical problem could be solved by using num_classes
info from XGB, like setting n_bins = max(10, num_classes)
.
Next weekend, I will look into the current way how XGB does the splitting. It is probably just a small fix.
I had a look at the R code in "utils.R". Small samples and stratification is generally a bad combination. If one fixes one problem, the next will pop up.
For not too small n, the imbalance not a problem:
n <- 100
y <- rep(0:1, each = n)
folds <- xgb.createFolds(y, 3)
lapply(folds, function(z) mean(z <= n))
# Gives three values very close to 50% in each fold.
What is surprising: If the objective is not "reg:squarederror", then y is turned into a factor and stratification is done within factor level. So if I run a Gamma (or Tweedie, Huber, Cox..). regression with 1 Mio distinct values, then the code creates overhead and can't stratify because each stratum has size 1.
if (params$objective != 'reg:squarederror')
y <- factor(y)
I agree that my above example is too small to be useful. I have just chosen a small example to make it easier to see the issue. I still would prefer to have folds that are as balanced as possible. Is it really so hard to fix this?
@hcho3 It would be great if we can allocate some time on reviewing the R package.
Dear xgboost developers, this is a follow-up to my issue #4509. I have upgrade to xgboost 1.0.0.2 and have checked the behaviour of the creation of stratified folds. The first problem I have reported in #4509 has been fixed with #4631. Thank you very much for this. The second problem is still there. Please have a look at the following code:
library(xgboost)
evaluation function that prints number of actual positive instances
numpos <- function(pred, dtrain){ truth <- as.numeric(xgboost::getinfo(dtrain, 'label')) print(sum(truth)) return(list(metric = 'numpos', value = sum(truth))) }
features_train = matrix(C<-(1:8),nrow=8, ncol=1) targets_train = c(0,0,0,0,1,1,1,1) dtrain <- xgb.DMatrix(features_train, label = targets_train) param <- list(objective = "binary:logistic")
set.seed(314159) xgb.cv(data=dtrain, nrounds=1, nfold=2,params=param, stratified = TRUE, feval = numpos, verbose=2)
When I run this code, I get the following output [1] 3 [1] 1 [1] 1 [1] 3 This means that I get one fold with 3 positive labels and 1 fold with 1 positive label. In stratified sampling, I would expect 2 positive labels in every fold, i.e the output [1] 2 [1] 2 [1] 2 [1] 2