dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

Predictions from R and Python don't match #1623

Closed hafen closed 8 years ago

hafen commented 8 years ago

I am trying to build a prediction function in R for a model that was trained in Python and the results don't match. See below for an example of how to reproduce.

I am fairly new to xgboost, particularly using it across languages, so I may be missing something obvious.

Environment info

Operating System: macOS 10.12

Compiler: gcc

Package used (python/R/jvm/C++): Python, R

xgboost version used: 0.6

If you are using python package, please provide

  1. The python version and distribution: Python 2.7.12 :: Anaconda 2.1.0 (x86_64)
  2. The command to install xgboost if you are not installing from source: Installed from source

If you are using R package, please provide

  1. The R sessionInfo()

    > sessionInfo()
    R version 3.3.1 (2016-06-21)
    Platform: x86_64-apple-darwin13.4.0 (64-bit)
    Running under: OS X 10.12 (Sierra)
    
    locale:
    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] xgboost_0.6  colorout_1.1-0
    
    loaded via a namespace (and not attached):
    [1] magrittr_1.5     Matrix_1.2-6     tools_3.3.1      stringi_1.1.1   
    [5] grid_3.3.1       data.table_1.9.6 stringr_1.1.0    chron_2.3-47    
    [9] lattice_0.20-33 
  2. The command to install xgboost if you are not installing from source: Installed from source

    Steps to reproduce

  3. Download this model file: http://ml.stat.purdue.edu/hafen/WTKG.model
  4. Run this script in R:

    library(xgboost)
    mod <- xgb.load("WTKG.model")
    x <- c(91, 9, 9, NA, NA, 273, 20, 170, NA, NA, 14, 14, 0,
     2, 0.94289404091, 0.94289404091, 0.93087973569, 0.0120143052199997, 0.95490834613,
     0.95490834613, 1, 90, 0.95490834613, 1, 90,
     0.93087973569, 357, -266, 0.93087973569, 357, -266,
     0.95490834613, NA, 0.93087973569, NA, NA, NA)
    d <- xgb.DMatrix(matrix(x, nrow = 1), missing = NA)
    predict(mod, d)
    # [1] 0.6483372
  5. Run this script in Python:

    import numpy as np
    import xgboost as xgb
    
    bst = xgb.Booster({'nthread': 4})
    bst.load_model('WTKG.model')
    x = [91, 9, 9, np.nan, np.nan, 273, 20, 170, np.nan, np.nan, 14, 14, 0,
     2, 0.94289404091, 0.94289404091, 0.93087973569, 0.0120143052199997, 0.95490834613,
     0.95490834613, 1, 90, 0.95490834613, 1, 90,
     0.93087973569, 357, -266, 0.93087973569, 357, -266,
     0.95490834613, np.nan, 0.93087973569, np.nan, np.nan, np.nan]
    d = xgb.DMatrix(data=[x], missing=np.nan)
    bst.predict(d)[0]
    #1.3775804
hafen commented 8 years ago

Never mind - it looks like the issue was on the Python end - should wrap [x] in np.array([x])