dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.23k stars 8.72k forks source link

[jvm-packages] what(): [06:24:26] /xgboost/src/tree/updater_histmaker.cc:307: fv=inf, hist.last=inf, while trying to run xgboost in scala spark #4977

Closed soumalya-hue closed 5 years ago

soumalya-hue commented 5 years ago
  1. The error I am facing is as follows: #################################################################### scala> val xgBoostModelWithDF = XGBoost.trainWithDataFrame(trainingData, paramMap,round = numRound, nWorkers = numWorkers, useExternalMemory = true) Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.16.1.241, DMLC_TRACKER_PORT=9092, DMLC_NUM_WORKER=1} [Stage 11:> (0 + 1) / 1]19/10/21 06:24:28 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_20_1 ! 19/10/21 06:24:28 ERROR YarnScheduler: Lost executor 6 on ip-172-16-1-197.ap-south-1.compute.internal: Container marked as failed: container_e11_1571576194248_0016_02_000003 on host: ip-172-16-1-197.ap-south-1.compute.internal. Exit status: 134. Diagnostics: 4248_0016/container_e11_1571576194248_0016_02_000003/stderr Last 4096 bytes of stderr : : hist[55]=40.6738 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[56]=49.8571 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[57]=73.2086 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[58]=97.1429 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[59]=176.245 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[60]=224.643 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[61]=691.417 [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[62]=inf [06:24:26] /xgboost/src/tree/updater_histmaker.cc:305: hist[63]=inf terminate called after throwing an instance of 'dmlc::Error' what(): [06:24:26] /xgboost/src/tree/updater_histmaker.cc:307: fv=inf, hist.last=inf

Stack trace returned 10 entries: [bt] (0) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(dmlc::StackTrace()+0x19d) [0x7f00924af4dd] [bt] (1) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7f00924b0388] [bt] (2) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(xgboost::tree::CQHistMaker::HistEntry::Add(float, xgboost::detail::GradientPairInternal)+0x4ef) [0x7f0092610dcf] [bt] (3) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(+0x1e829a) [0x7f009260829a] [bt] (4) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(+0x1e8ac7) [0x7f0092608ac7] [bt] (5) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(xgboost::tree::GlobalProposalHistMaker::CreateHist(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix, std::vector<unsigned int, std::allocator > const&, xgboost::RegTree const&)+0x749) [0x7f0092616099] [bt] (6) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(xgboost::tree::HistMaker::Update(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix, xgboost::RegTree)+0x2d9) [0x7f00926183e9] [bt] (7) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(xgboost::tree::HistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree, std::allocator<xgboost::RegTree> > const&)+0xa3) [0x7f009260a973] [bt] (8) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete > > >)+0x80b) [0x7f009253a4fb] [bt] (9) /hadoop/yarn/local/usercache/spark/appcache/application_1571576194248_0016/container_e11_1571576194248_0016_02_000003/tmp/libxgboost4j2340608630152808150.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction*)+0x825) [0x7f009253af35]

####################################################################

  1. We are using Spark version 2.3.2.3.1.0.0-78 and xgboost jars : /home/ubuntu/xgboost4j-0.72.jar,/home/ubuntu/xgboost4j-spark-0.72.jar

The code is as follows: To get into spark shell I use : #################################################################### spark-shell --jars /home/ubuntu/xgboost4j-0.72.jar,/home/ubuntu/xgboost4j-spark-0.72.jar #################################################################### The current code that I am using is as follows: #################################################################### import java.util.Calendar import org.apache.log4j.{Level, Logger} import ml.dmlc.xgboost4j.scala.spark.XGBoost import org.apache.spark.ml.feature. import org.apache.spark.sql. import org.apache.spark.sql.functions.lit

val now=Calendar.getInstance() val date=java.time.LocalDate.now val currentHour = now.get(Calendar.HOUR_OF_DAY) val currentMinute = now.get(Calendar.MINUTE) val direct="./results/"+date+"-"+currentHour+"-"+currentMinute+"/" println(direct)

val dataset = spark.read.option("header", "true").option("inferSchema", true).csv("/pyspark_model_run/Existing_train_data_Prescoring_mod.csv") val datatest = spark.read.option("header", "true").option("inferSchema", true).csv("/pyspark_model_run/Existing_test_data_Prescoring_mod.csv")

dataset.cache() datatest.cache()

val df = dataset.na.fill(0).sample(true,0.7,10) val df_test = datatest.na.fill(0)

val header = df.columns.filter(!_.contains("train_dataMSISDN")).filter(!.contains("labels")) val assembler = new VectorAssembler().setInputCols(header).setOutputCol("features")

val train_DF0 = assembler.transform(df) val test_DF0 = assembler.transform(df_test)

println("VectorAssembler Done!")

val train = train_DF0.withColumn("label", df("labels").cast("double")).select("label", "features") // val test = test_DF0.withColumn("label", lit(1.0)).withColumnRenamed("Id","id").select("id", "label", "features") val test = test_DF0.withColumn("label", df_test("ts_label").cast("double")).withColumn("test_data_MSISDN", df_test("test_data_MSISDN").cast("double")).select("test_data_MSISDN","label","features")

// Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = train.randomSplit(Array(0.7, 0.3), seed = 0)

println(trainingData.size)

// number of iterations val numRound = 2 val numWorkers = 1 // training parameters val paramMap = List( "eta" -> 0.01, "max_depth" -> 5, "min_child_weight" -> 1, "subsample" -> 1, "colsample_bytree" -> 1, "eval_metric" -> "auc", "objective" -> "binary:logistic").toMap println("Starting Xgboost ")

val xgBoostModelWithDF = XGBoost.trainWithDataFrame(trainingData, paramMap,round = numRound, nWorkers = numWorkers, useExternalMemory = true)

###################################################################

  1. I am also attaching the data point, but removing the MSISDN field which is just the ID field, I am renaming the features as X1,X2,...,X30 and I am renaming the Y or the label field as 'label': Existing_train_data_Prescoring_mod.zip
soumalya-hue commented 5 years ago

Please could you help me with this issue as soon as possible, Our production model deployment is stopped due to this is issue!!! The issue does not come when we set numRound =1 but as soon as we change to numRound =2,we get this issue

hcho3 commented 5 years ago

Did you try using latest version (0.90)?

soumalya-hue commented 5 years ago

Nope,can you tell me is it compatible with 2.3.2.3.1.0.0-78 and which jars should I use , if I am manually downloading the jars for xgboost?

hcho3 commented 5 years ago

Actually, 0.90 requires Spark 2.4. You should use 0.82 instead.

soumalya-hue commented 5 years ago

Can you provide me a guiding link to install 0.82 xgboost , so that I can run a similar code in Spark 2.3.2.3.1.0.0-78?

hcho3 commented 5 years ago

According to https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#refer-to-xgboost4j-spark-dependency, you can specify XGBoost4J-Spark as a dependency in pom.xml. In general, all recent versions of XGBoost4J-Spark is available on Maven Central.