dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
25.77k stars 8.69k forks source link

R xgboost Binary Classification: understanding model and scores #2052

Closed sjain777 closed 7 years ago

sjain777 commented 7 years ago

Hi, I am using the following versions of R packages in R 3.3.2 (x64): xgboost version 0.6-4 Matrix version 1.2-6 for a binary classification problem. My input data is originally categorical in type (levels: 0/1) which I convert to a sparse matrix format since my data is very sparse with thousands of such columns.

Upon building a Boosted Decision Tree model, I see some things which I would like to understand better in order to make sure whether it is a feature of xgboost, or if I am doing something wrong in my data pre-processing and a call to xgb.train.

I have appended my work-flow on example-data at the end, and here are my questions:

1) The input features (columns) in my training data (TrainDcgMatrix) have values 1 or 0. Why do we have the split value as -9.53674e-007 for all splits in the model, and not 0 or 1 or 0.5? Am I creating TrainDcgMatrix and Traindata correctly for a binary-classification problem with categorical data?

 Somewhere I read that in xgboost it is important to have the input features defined as
 "numeric", hence that's my current type of columns in TrainDcgMatrix.

2) I turned off scientific notation by: options(scipen = 999). Then why do I see the split value above in scientific notation? Scientific notation causes error in parsing the PMML, hence I would like to avoid using scientific notation.

3) I see the leaf scores are positive as well as negative. Is the range allowed for leaf scores between -1 to +1?

4) If we use the predict method from xgboost, we always get final probability between [0, 1]. How do we combine the scores from multiple trees (as in the model dump) for the final predicted probability for the following cases: a. num_parallel_tree = 1; nrounds = 100 (Boosted Decision Tree) b. num_parallel_tree = 100; nrounds = 1 (Random Forest)

Thanks much in advance for help on the above questions.

Example code

library(xgboost) library(Matrix)

Problem: Data with 1000 unique RowId, each having a set of grocery items purchased; set up a binary classification problem to predict purchase of "Carrots".

List of items

groceryItems <- c("Eggs", "Soap_bars", "Plain_aspirin", "Squirty_soap", "Washing_tablets", "Dishwasher_tablets", "Mouthwash", "Tissues_box", "Tissues_packet", "Rosemary_dried", "Olive_oil", "Cheese", "Pepper", "Dishwasher_salt", "Cheese_Biscuits", "Yoghurts", "Margarine", "Freezer_bags", "Beans", "Milk", "Washing_tablets", "Kitchen_towel", "Gravy_granules", "Assorted_beans", "Onion", "Carrots", "Chicken", "Small_tin_of_salmon", "Crisps", "Weetabix", "Decaf_coffee_grounds", "Non_alcoholic_beer", "Milk", "Beer", "Margarine", "Cheese", "Salt", "Bread")

Generate fake data:

for(irow in 1:1000)
{
  nitems <- sample(2:6, 1)
  sample(groceryItems, nitems)

  if(irow == 1) 
  { 
    df <- data.frame(RowId = irow, Items = sample(groceryItems, nitems))
  } else  {
    temp <- data.frame(RowId = irow, Items = sample(groceryItems, nitems))
    df <- rbind(df, temp)
  }
}
# Make sure for a given RowId, the Items are unique
df <- unique(df)

Define the outcome for CLASSIFICATION

Predictable <- "Carrots"
RowId_Pos <- unique(df[df$Items == Predictable, ]$RowId)

for(irow in 1:1000)
{  
  if(irow == 1) 
  { 
    dfOutcome <- data.frame(RowId = irow, Carrots = ifelse(irow %in% RowId_Pos, 1, 0))
  } else  {
    temp <- data.frame(RowId = irow, Carrots = ifelse(irow %in% RowId_Pos, 1, 0))
    dfOutcome <- rbind(dfOutcome, temp)
  }
}

Define input features

cFeatures <- setdiff(unique(df$Items), Predictable)

Preprocess train data to SPARSE matrix format

# Subset data to keep only RowIds with input features
dfInput <- df[df$Items %in% cFeatures, ]
# Re-level data after subsetting
dfInput$Items <- factor(dfInput$Items, levels = unique(dfInput$Items))

TrainDcgMatrix <- sparseMatrix(i = as.integer(as.factor(dfInput$RowId)), j = as.integer(as.factor(dfInput$Items)), x = 1)`
rownames(TrainDcgMatrix) <- dfOutcome$RowId
colnames(TrainDcgMatrix) <- levels(dfInput$Items)

# get the outcome for label
outcome <- as.integer(dfOutcome[, Predictable])

Traindata <- xgb.DMatrix(TrainDcgMatrix, label=outcome)
set.seed(71) # for reproducibility of results
options(scipen = 999) # turn off scientific notation (since this causes problem with PMML reader)

Boosted Decision Trees

param <- list(silent = 1, eta = 0.1, max_depth = 6, num_parallel_tree = 1, colsample_bytree = 1, subsample = 1, objective = "binary:logistic", eval_metric = "auc")
ensModel <- xgb.train(param, data = Traindata, nrounds = 100,  save_period = NULL)

# Dump model
model <- xgb.dump(ensModel, with_stats = T)
model[1:20]

# [1] "booster[0]"                                                           
# [2] "0:leaf=-0.158566,cover=250"                                           
# [3] "booster[1]"                                                           
# [4] "0:[f3<-9.53674e-007] yes=1,no=2,missing=1,gain=0.179689,cover=248.435"
# [5] "1:leaf=-0.138207,cover=196.015"                                       
# [6] "2:leaf=-0.161276,cover=52.4198"                                       
# [7] "booster[2]"                                                           
# [8] "0:[f3<-9.53674e-007] yes=1,no=2,missing=1,gain=0.241786,cover=244.393"
# [9] "1:leaf=-0.126508,cover=192.97"                                        
# [10] "2:leaf=-0.148351,cover=51.4236"                                       
# [11] "booster[3]"                                                           
# [12] "0:[f3<-9.53674e-007] yes=1,no=2,missing=1,gain=0.331686,cover=238.634"
# [13] "1:leaf=-0.116639,cover=188.672"                                       
# [14] "2:leaf=-0.137838,cover=49.9617"                                       
# [15] "booster[4]"                                                           
# [16] "0:[f3<-9.53674e-007] yes=1,no=2,missing=1,gain=0.430625,cover=231.734"
# [17] "1:leaf=-0.108111,cover=183.545"                                       
# [18] "2:leaf=-0.129046,cover=48.1884"                                       
# [19] "booster[5]"                                                           
# [20] "0:[f3<-9.53674e-007] yes=1,no=2,missing=1,gain=0.529414,cover=224.129"
sjain777 commented 7 years ago

Hello, could anyone please address the above questions? Thanks much in advance!

sjain777 commented 7 years ago

Questions 1 and 4 above are answered in the following post: https://github.com/dmlc/xgboost/issues/364 Hence, closing this issue.

diziup commented 7 years ago

Hi @sjain777 , Did you understand the given answer in the above link? I could not, and I am having difficulties understanding how XGBOOST actually predicts any probability (I am using the python wrapper but I think that's not really important),

Thanks!