dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.23k stars 8.72k forks source link

spark xgboost prediction accuracy and auc is much lower than that in log #2541

Closed GeorgeXia1828 closed 7 years ago

GeorgeXia1828 commented 7 years ago

I am training spark xgboost model, the train-error in log is about 0.28, and I saved the model, then load model to test it on test set, get very bad auc and accuracy (auc = 0.65, acc=0.55), which I think should be acc is about 0.72, auc should be much higher than 0.72. also, I tried it on train set, the same result to test set. So I was confused, why accuracy is different from the log ?

1, My trainModel code

import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.linalg.{DenseVector => MLDenseVector}
import org.apache.spark.ml.feature.{LabeledPoint => MLLabeledPoint}
import org.apache.spark.sql.SparkSession
import DataUtils._

object trainModel{
    def main(args: Array[String]){
        val spark = SparkSession.builder.appName("xiajizhong").getOrCreate()
        val sc = spark.sparkContext

        //val inputTrainPath = "/user/gulfstream/zhenpeng/xgboost/demo/data/agaricus.txt.train"
        //val inputTrainPath = "/user/bigdata_driver_ecosys_test/xiajizhong/base/201706_space"
        //val inputTestPath = "/user/gulfstream/zhenpeng/xgboost/demo/data/agaricus.txt.test"
        val inputTrainPath = args(0)

        val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).map(lp =>
            MLLabeledPoint(lp.label, new MLDenseVector(lp.features.toArray)))
        //val testSet = MLUtils.loadLibSVMFile(sc, inputTestPath).collect().map(
        //     lp => new MLDenseVector(lp.features.toArray)).iterator
        val paramMap = List(
              "eta" -> 0.5f,
              "max_depth" -> 6,
              "objective" -> "binary:logistic",
              "booster" -> "gbtree",
              "tree_method" -> "exact").toMap
        val xgboostModel = XGBoost.train(trainRDD, paramMap, round=100, nWorkers=10, useExternalMemory=true)
        //xgboostModel.booster.predict(new DMatrix(testSet))
        //val outputModelPath = "/user/bigdata_driver_ecosys_test/xiajizhong/base/xgb_06.model"
        val outputModelPath = args(1)
        xgboostModel.saveModelAsHadoopFile(outputModelPath)(sc)
   }   
}

the log is: 2017-07-24 12:50:11,391-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:50:11,390 INFO [93] train-error:0.287650 2017-07-24 12:50:40,961-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:50:40,960 INFO [94] train-error:0.287634 2017-07-24 12:51:10,258-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:51:10,258 INFO [95] train-error:0.287631 2017-07-24 12:51:39,403-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:51:39,403 INFO [96] train-error:0.287623 2017-07-24 12:52:09,241-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:52:09,241 INFO [97] train-error:0.287612 2017-07-24 12:52:38,593-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:52:38,592 INFO [98] train-error:0.287607 2017-07-24 12:53:07,767-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:53:07,767 INFO [99] train-error:0.287586

2, my test model code

import org.apache.spark.{ SparkConf, SparkContext }
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import org.apache.spark.sql.{ SparkSession, Row }
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg.Vectors

object validation{
    def main(args: Array[String]){
        val spark = SparkSession.builder.appName("xiajizhong").getOrCreate()
        val sc = spark.sparkContext

        val inputTestPath = args(0)
        val modelPath = args(1)    
        val outputPredictPath = args(2)

        //val test = MLUtils.loadLibSVMFile(sc, inputTestPath).toDF("label", "features")
        val test = spark.read.format("libsvm").load(inputTestPath).toDF("label", "features")
        val xgbModel = XGBoost.loadModelFromHadoopFile(modelPath)(sc)
        val predict = xgbModel.transform(test)
        predict.rdd.saveAsTextFile(outputPredictPath)
   }   
}

3, then I load the saved predict result use pyspark to calculate auc. using (from pyspark.mllib.evaluation import BinaryClassificationMetrics), I get the result auc(using areaUnderROC) is only 0.65, and I tried again on train set and the same!

defaultRobot commented 7 years ago

I also met this problem. Do you have solved it?

ghost commented 6 years ago

Have you solved this problem?

GeorgeXia1828 commented 6 years ago

@hzliang just index error, you should train with index start from 1 and test with python model with index from 0. You can find it described at the github sparkxgboost page.

GeorgeXia1828 commented 6 years ago

the train error in train log is indeed right,

xgbModel.booster.saveModel("/local/path"), then you can use it in xgboost python api.

when using python module to do predict, the feature should transform from ... 1:xx_1,2:xx_2,3:xx_3 to 0:xx_1,1:xx_2,2:xx_3,3:0 maybe this will solve problem transform spark xgboost.