Closed sjain777 closed 7 years ago
Hello, could anyone please address the above questions? Thanks much in advance!
Questions 1 and 4 above are answered in the following post: https://github.com/dmlc/xgboost/issues/364 Hence, closing this issue.
Hi @sjain777 , Did you understand the given answer in the above link? I could not, and I am having difficulties understanding how XGBOOST actually predicts any probability (I am using the python wrapper but I think that's not really important),
Thanks!
Hi, I am using the following versions of R packages in R 3.3.2 (x64): xgboost version 0.6-4 Matrix version 1.2-6 for a binary classification problem. My input data is originally categorical in type (levels: 0/1) which I convert to a sparse matrix format since my data is very sparse with thousands of such columns.
Upon building a Boosted Decision Tree model, I see some things which I would like to understand better in order to make sure whether it is a feature of xgboost, or if I am doing something wrong in my data pre-processing and a call to xgb.train.
I have appended my work-flow on example-data at the end, and here are my questions:
1) The input features (columns) in my training data (TrainDcgMatrix) have values 1 or 0. Why do we have the split value as -9.53674e-007 for all splits in the model, and not 0 or 1 or 0.5? Am I creating TrainDcgMatrix and Traindata correctly for a binary-classification problem with categorical data?
2) I turned off scientific notation by: options(scipen = 999). Then why do I see the split value above in scientific notation? Scientific notation causes error in parsing the PMML, hence I would like to avoid using scientific notation.
3) I see the leaf scores are positive as well as negative. Is the range allowed for leaf scores between -1 to +1?
4) If we use the predict method from xgboost, we always get final probability between [0, 1]. How do we combine the scores from multiple trees (as in the model dump) for the final predicted probability for the following cases: a. num_parallel_tree = 1; nrounds = 100 (Boosted Decision Tree) b. num_parallel_tree = 100; nrounds = 1 (Random Forest)
Thanks much in advance for help on the above questions.
Example code
library(xgboost)
library(Matrix)
Problem: Data with 1000 unique RowId, each having a set of grocery items purchased; set up a binary classification problem to predict purchase of "Carrots".
List of items
groceryItems <- c("Eggs", "Soap_bars", "Plain_aspirin", "Squirty_soap", "Washing_tablets", "Dishwasher_tablets", "Mouthwash", "Tissues_box", "Tissues_packet", "Rosemary_dried", "Olive_oil", "Cheese", "Pepper", "Dishwasher_salt", "Cheese_Biscuits", "Yoghurts", "Margarine", "Freezer_bags", "Beans", "Milk", "Washing_tablets", "Kitchen_towel", "Gravy_granules", "Assorted_beans", "Onion", "Carrots", "Chicken", "Small_tin_of_salmon", "Crisps", "Weetabix", "Decaf_coffee_grounds", "Non_alcoholic_beer", "Milk", "Beer", "Margarine", "Cheese", "Salt", "Bread")
Generate fake data:
Define the outcome for CLASSIFICATION
Define input features
cFeatures <- setdiff(unique(df$Items), Predictable)
Preprocess train data to SPARSE matrix format
Boosted Decision Trees