dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.11k stars 8.7k forks source link

[R] bug in creation of stratified folds #5670

Open JaakobKind opened 4 years ago

JaakobKind commented 4 years ago

Dear xgboost developers, this is a follow-up to my issue #4509. I have upgrade to xgboost 1.0.0.2 and have checked the behaviour of the creation of stratified folds. The first problem I have reported in #4509 has been fixed with #4631. Thank you very much for this. The second problem is still there. Please have a look at the following code:

library(xgboost)

evaluation function that prints number of actual positive instances

numpos <- function(pred, dtrain){ truth <- as.numeric(xgboost::getinfo(dtrain, 'label')) print(sum(truth)) return(list(metric = 'numpos', value = sum(truth))) }

features_train = matrix(C<-(1:8),nrow=8, ncol=1) targets_train = c(0,0,0,0,1,1,1,1) dtrain <- xgb.DMatrix(features_train, label = targets_train) param <- list(objective = "binary:logistic")

set.seed(314159) xgb.cv(data=dtrain, nrounds=1, nfold=2,params=param, stratified = TRUE, feval = numpos, verbose=2)

When I run this code, I get the following output [1] 3 [1] 1 [1] 1 [1] 3 This means that I get one fold with 3 positive labels and 1 fold with 1 positive label. In stratified sampling, I would expect 2 positive labels in every fold, i.e the output [1] 2 [1] 2 [1] 2 [1] 2

trivialfis commented 4 years ago

Is there any "defacto" functions for doing CV on R?

trivialfis commented 4 years ago

I'm asking because if there are, we can provide tight integration with them and don't have to maintain one ourselves. I'm sure a defacto implementation can do a better job than Xgboost.

t-wojciech commented 4 years ago

I don't think there is a pure CV function in R. Usually separate packages like mlr3 or rsample are used to CV. Here you can also find a solution from other packages.

Maybe @jameslamb will say something more about it?

jameslamb commented 4 years ago

I'm not aware of a lightweight library in R that is the "defacto" for doing cross validation in R, equivalent to GridSearchCV from scikit-learn in Python.

The closest might be the CV options available from {caret} (like in this blog post and this other blogpost), but it's not as clear a winner in R world as scikit-learn is in Python.

In LightGBM we ended up implementing our own (https://github.com/microsoft/LightGBM/blob/1f3e72c43ca8485eeba988738ecb0e977c7977f1/R-package/R/lgb.cv.R).

terrytangyuan commented 4 years ago

This one is part of tidymodels which is a large community for modeling packages in R: https://github.com/tidymodels/tune

mayer79 commented 4 years ago

In order to get fold indices, I have recently put on CRAN a lightweight R package without any dependency for this purpose: splitTools.

It supports

has a flexible output interface and is optimized for speed. Maybe it would be an alternative to heavier dependencies such as caret.

In the situation of the OP

library(splitTools)

targets_train = c(0,0,0,0,1,1,1,1)
folds <- create_folds(targets_train, k = 2, type = "stratified", seed = 1)

for (fold in folds) {
  cat("\n\nIndices:", fold)
  cat("\nValues:", targets_train[fold])
}

# Results
Indices: 2 3 6 7
Values: 0 0 1 1

Indices: 1 4 5 8
Values: 0 0 1 1

A technical problem: With stratified splitting, regression labels should be treated differently than classification labels. For regression labels, usually quantile splitting is done, while for classification, no preprocessing is required. Since XGBoost's response is numeric for both regression and classification tasks, this can cause unclear situations. The logic in splitTools is to first check for the number of distinct values in the response. If it is larger than n_bins = 10, it does quantile binning. Otherwise, it does no binning. The technical problem could be solved by using num_classes info from XGB, like setting n_bins = max(10, num_classes).

mayer79 commented 4 years ago

Next weekend, I will look into the current way how XGB does the splitting. It is probably just a small fix.

mayer79 commented 4 years ago

I had a look at the R code in "utils.R". Small samples and stratification is generally a bad combination. If one fixes one problem, the next will pop up.

For not too small n, the imbalance not a problem:

n <- 100
y <- rep(0:1, each = n)
folds <- xgb.createFolds(y, 3)
lapply(folds, function(z) mean(z <= n))

# Gives three values very close to 50% in each fold.

What is surprising: If the objective is not "reg:squarederror", then y is turned into a factor and stratification is done within factor level. So if I run a Gamma (or Tweedie, Huber, Cox..). regression with 1 Mio distinct values, then the code creates overhead and can't stratify because each stratum has size 1.

 if (params$objective != 'reg:squarederror')
        y <- factor(y)
JaakobKind commented 3 years ago

I agree that my above example is too small to be useful. I have just chosen a small example to make it easier to see the issue. I still would prefer to have folds that are as balanced as possible. Is it really so hard to fix this?

trivialfis commented 3 years ago

@hcho3 It would be great if we can allocate some time on reviewing the R package.