dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.15k stars 8.71k forks source link

[R] xgb.cv doesn't return feature names #5018

Open nettoyoussef opened 4 years ago

nettoyoussef commented 4 years ago

Hi all,

Long fan of your efforts with the Xgboost algorithm/implementation. It is super fast and memory-friendly.

I found a problem when trying to see feature importance when using the xgb.cv function, namely that it doesn't return the features names when using the callback cb.cv.predict(save_models = TRUE).

I found this trying to plot the model importance using xgb.plot.importance. Does the numbers refer to the python way of counting columns (i.e., starting from 0)?

I made an MRE below:

Xgboost version: xgboost_0.90.0.2 (R package)

data(iris)
library(xgboost)
library(dplyr)

iris <- filter(iris, Species != 'setosa')
features <- as.matrix(iris[, !grepl('Species', colnames(iris))])
label <- ifelse(iris$Species == 'virginica', 1, 0)

model <- xgboost::xgb.cv(
                      data = features
                    , label = label 
                    , nfold = 5 
                    , nrounds = 25
                    , metrics = list("auc")
                    , stratified = TRUE
                    , verbose = TRUE
                    , callbacks = list(cb.cv.predict(save_models = TRUE))

                    , params = list(
                        eta = 0.1
                        , max_depth = 10
                        , objective = "binary:logistic"
                        , colsample_bytree = 0.5
                        , subsample = 0.5
                        , nthread = 2
                        , seed = 1
                        )
                    )

importance <- xgb.importance(model = model$models[[1]])
xgboost::xgb.plot.importance(importance)

xgboost_reprex

nettoyoussef commented 4 years ago

We have a workaround for the problem passing the names of the features, as exposed here:

importance <- xgb.importance(model = model$models[[1]], feature_names = colnames(features))

But I still find that it would be advisable to correct the original problem.

trivialfis commented 2 years ago

Sorry for the delay.

Note to myself: Need to store the feature names into booster after UBJSON is merged. This line sets the feature names for booster: https://github.com/dmlc/xgboost/blob/e94b76631035cd8b3a5cdd0c883225f069e74686/R-package/R/xgb.train.R#L380