dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.18k stars 8.71k forks source link

[jvm-packages] java.lang.ArrayStoreException: [Lml.dmlc.xgboost4j.LabeledPoint; #3092

Closed gorkemozkaya closed 6 years ago

gorkemozkaya commented 6 years ago

For bugs or installation issues, please provide the following information. The more information you provide, the more easily we will be able to offer help and advice.

Environment info

Operating System: Linux

Compiler: g++

Steps to reproduce:

  1. Connect to a Spark cluster using Jupyter with Apache Toree kernel
  2. 
    import ml.dmlc.xgboost4j.scala.spark.{XGBoost}
    import org.apache.spark.ml.feature.LabeledPoint
    import org.apache.spark.ml.linalg.DenseVector
    import ml.dmlc.xgboost4j.scala.spark.XGBoost
    import org.apache.spark.ml.feature.LabeledPoint
    import org.apache.spark.ml.linalg.DenseVector

val trainRDD = sc.parallelize(Seq( LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(0.0, new DenseVector(Array(5.0, 5.0, 5.0))), LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(0.0, new DenseVector(Array(5.0, 5.0, 5.0))), LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(0.0, new DenseVector(Array(5.0, 5.0, 5.0))), LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(0.0, new DenseVector(Array(5.0, 5.0, 5.0))), LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(0.0, new DenseVector(Array(5.0, 5.0, 5.0))), LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(1.0, new DenseVector(Array(2.0, 3.0, 4.0))), LabeledPoint(0.0, new DenseVector(Array(5.0, 5.0, 5.0))) ), 4)

val paramMap = List( "eta" -> 0.1f, "max_depth" -> 2, "objective" -> "binary:logistic").toMap

val xgboostModelRDD = XGBoost.train(trainRDD, paramMap, 1, 4, useExternalMemory=true)


3. And I get the following error: Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.28.2.37, DMLC_TRACKER_PORT=9103, DMLC_NUM_WORKER=4}
lastException = null
Out[7]:
Name: org.apache.spark.SparkDriverExecutionException
Message: Execution error
StackTrace:   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1305)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1279)
  at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1319)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.first(RDD.scala:1318)
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBoosters(XGBoost.scala:86)
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithRDD(XGBoost.scala:277)
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.train(XGBoost.scala:205)
  ... 42 elided
Caused by: java.lang.ArrayStoreException: [Lml.dmlc.xgboost4j.LabeledPoint;
  at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1899)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1899)
  at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)

## What have you tried?

1. Using spark-shell instead of Jupyter, which eliminates the error.
2. I made sure the spark configuration is exactly the same in Jupyter and spark-shell
superbobry commented 6 years ago

This is unlikely to be an XGBoost error since the code works fine in spark-shell. I'd suggest submitting the issue to Apache Toree. It could be that multiple LabeledPoint classes (one in spark-ml and one in xgboost) confuse the kernel.