dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.73k forks source link

Segmentation Fault in R under Linux #210

Closed larry77 closed 9 years ago

larry77 commented 9 years ago

I have a reproducibile segmentation fault on my machine (running debian) when I use the R bindings to xgboost.. Please find the script at the end of the issue. The test and train data are taken from the kaggle restaurant revenue competition

http://www.kaggle.com/c/restaurant-revenue-prediction

I also attach the output of my sessionInfo()

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C            
 [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8   
 [5] LC_MONETARY=en_GB.utf8    LC_MESSAGES=en_GB.utf8  
 [7] LC_PAPER=en_GB.utf8       LC_NAME=C               
 [9] LC_ADDRESS=C              LC_TELEPHONE=C          
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C     

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] xgboost_0.3-4 Matrix_1.1-4

loaded via a namespace (and not attached):
 [1] chron_2.3-43        Ckmeans.1d.dp_3.3.1 colorspace_1.2-4  
 [4] curl_0.5            data.table_1.9.4    DiagrammeR_0.5    
 [7] digest_0.6.4        ggplot2_1.0.1       grid_3.1.1        
[10] gtable_0.1.2        htmltools_0.2.6     htmlwidgets_0.3.2 
[13] jsonlite_0.9.14     lattice_0.20-29     magrittr_1.5      
[16] MASS_7.3-34         munsell_0.4         plyr_1.8.1        
[19] proto_0.3-10        Rcpp_0.11.0         reshape2_1.4      
[22] RJSONIO_1.0-3       rstudioapi_0.2      scales_0.2.4      
[25] stringr_0.6.2       V8_0.5            

###############################################################################################

rm(list=ls())
library(Matrix)
library(xgboost)

## see http://bit.ly/1CjehL9

one_hot <- function(df){
n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
     rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
res <- sparseMatrix(i = i, j = j, x = x)

return(res)

}

################################################################
################################################################
################################################################
################################################################
################################################################

cv.nround <- 100
nfold <- 4

train<-read.csv("train.csv", header=T,stringsAsFactors = T )
test<-read.csv("test.csv", header=T,stringsAsFactors = T )

y <- train$revenue

## Combine train and test

## (in order to do so, remove the revenue value from the train set)

train <- subset(train, select=-c(revenue))

x <- rbind(test,train)

#remove the id value which has no meaning

x <- subset(x, select=-c(Id))

#Change the date into years and months (as factors!)

date <- as.character(x$Open.Date)
date2 <- strptime(date, format="%m/%d/%Y")

year <- as.POSIXlt(date2)$year + 1900
month <- as.POSIXlt(date2)$mon + 1

x <- subset(x, select=-c(Open.Date))

x$year <- year
x$month <- month

x$month <- as.factor(x$month)
x$year <- as.factor(x$year)

## apply one-hot encodying

x <- one_hot(x)

trind = 1:length(y) ## rows from the training dataset in x
teind = (nrow(train)+1):nrow(x) ## rows from the test dataset in x

# Set necessary parameter
param <- list("objective" = "reg:linear",
              "max_depth"=6,
              "eta"=0.1,
              "subsample"=1,
              "gamma"=1,
               "min_child_weight"=1,
              "eval_metric" = "mlogloss",
              "silent"=1,
              "num_class" = 9,
              "nthread" = 6)

bst.cv = xgb.cv(param=param, data = x[trind,], label = y,
                nfold = nfold, nrounds=cv.nround)
hetong007 commented 9 years ago

Thanks for you report. It seems to me the crash is caused by non-matching "objective", reg:linear, and "eval_metric", mlogloss. Here simply remove "eval_metric" could solve this problem.

For the details about evaluation metric and objective, please visit our wiki page: https://github.com/dmlc/xgboost/wiki/Parameters . We are just about to add more checks on the parameters to avoid these crashes.

tqchen commented 9 years ago

I am closing thi issue since this was due to mismatch between metric and loss.

larry77 commented 9 years ago

Hi, Well, I agree that that was the problem, but it would be nice in such a case to get an error message from xgboost rather than a segmentation fault. Cheers

Lorenzo

On Wed, Apr 01, 2015 at 09:05:55PM -0700, Tianqi Chen wrote:

I am closing thi issue since this was due to mismatch between metric and loss.


Reply to this email directly or view it on GitHub: https://github.com/dmlc/xgboost/issues/210#issuecomment-88720363

pommedeterresautee commented 9 years ago

@tqchen do you think the check should be done at R / Python level or in C code?

tqchen commented 9 years ago

I totally agree that the error message should be shot instead of crash. I have pushed a fix on this. Thanks!

gmckinnon commented 9 years ago

Does anyone else get the error message when running this:

Error in xgb.iter.eval(fd$booster, fd$watchlist, i - 1, feval) : label and prediction size not matchhint: use merror or mlogloss for multi-class classification

My code is similar to this and getting the exact same error.

sessionInfo() R version 3.1.3 (2015-03-09) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200)

locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] dplyr_0.4.1 ade4_1.6-2 eeptools_0.3.1 MASS_7.3-39 caret_6.0-41 ggplot2_1.0.1
[7] lattice_0.20-30 Boruta_4.0.0 rFerns_1.1.0 randomForest_4.6-10 xgboost_0.3-4 Matrix_1.2-0

loaded via a namespace (and not attached): [1] abind_1.4-3 arm_1.7-07 assertthat_0.1 BradleyTerry2_1.0-6 brglm_0.5-9 car_2.0-25
[7] chron_2.3-45 Ckmeans.1d.dp_3.3.1 coda_0.17-1 codetools_0.2-10 colorspace_1.2-6 curl_0.5
[13] data.table_1.9.4 DBI_0.3.1 DiagrammeR_0.5 digest_0.6.8 foreach_1.4.2 foreign_0.8-63
[19] grid_3.1.3 gtable_0.1.2 gtools_3.4.1 htmltools_0.2.6 htmlwidgets_0.3.2 iterators_1.0.7
[25] jsonlite_0.9.14 lme4_1.1-7 magrittr_1.5 maptools_0.8-34 memisc_0.97 mgcv_1.8-4
[31] minqa_1.2.4 munsell_0.4.2 nlme_3.1-120 nloptr_1.0.4 nnet_7.3-9 parallel_3.1.3
[37] pbkrtest_0.4-2 plyr_1.8.1 proto_0.3-10 quantreg_5.11 Rcpp_0.11.5 reshape2_1.4.1
[43] RJSONIO_1.3-0 rstudioapi_0.2 scales_0.2.4 sp_1.0-17 SparseM_1.6 splines_3.1.3
[49] stringr_0.6.2 tools_3.1.3 V8_0.5

tqchen commented 9 years ago

see the hint, it is likely you set the wrong evaluation metric, use mlogloss or merror for evaluation multi-class classification instead

tqchen commented 9 years ago

Interesting. Can you check if #row x and length(y )matches with each other? If you feel this is a bug, you can submit a script with simulated data to reproduce the problem.

Thanks

On Sat, Apr 11, 2015 at 8:45 PM, gmckinnon notifications@github.com wrote:

I understand that but in this case its a regression problem, my y labels are numeric values so using the reg:linear objective and default metric it should be fine but I get the above error.

Ill attach my console output for x and y

[image: image] https://cloud.githubusercontent.com/assets/7494314/7104311/2478d790-e11a-11e4-8ed4-b1c2553a1f7b.png

— Reply to this email directly or view it on GitHub https://github.com/dmlc/xgboost/issues/210#issuecomment-91981416.

Sincerely,

Tianqi Chen Computer Science & Engineering, University of Washington

gmckinnon commented 9 years ago

All good I fixed it, The above script uses "num_class" = 9,when this is a regression problem.