dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.09k stars 8.7k forks source link

[jvm-packages] Rabit Initialization Problem? #2148

Closed lserlohn closed 7 years ago

lserlohn commented 7 years ago

Hell, I am trying to launch a very simple Spark program with XGBoost.

The code is almost the same as the example. The dataset is built in the code as:

feature               label
1,2,3,4,...10        11
2,3,4,5,...11        12
3,4,5,6,...12        13

The code can be complied, without problem.

Then, I submited the job, along with two xgboost4j jars with all the dependencies

spark-submit --driver-memory 4g --conf 'spark.mesos.coarse=true' --jars xgboost4j-0.7-jar-with-dependencies.jar,xgboost4j-spark-0.7-jar-with-dependencies.jar xgboostspark_2.11-1.0.jar

But I got errors as follows.

I would like to know, is submitting xgboost4j related jars with dependencies, good enough?

Do I need to install Rabit/dmlc-core and anything else on each individual node in the cluster? So far, none of the node has been installed. I wish the jars with dependencies including them...

  def main(args: Array[String]): Unit = {

    // create SparkSession
    val sparkConf = new SparkConf().setAppName("XGBoost-spark-example")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    sparkConf.registerKryoClasses(Array(classOf[Booster]))
    val ss = SparkSession.builder().appName("XGBoost-spark-example")
                                   .config(sparkConf)
                                   .enableHiveSupport()
                                   .getOrCreate()

    // create training and testing dataframes
    val schemaString = "value1 value2 value3 value4 value5 " +
      "value6 value7 value8 value9 value10 label"
    val numCol = 11
    val arraydataInt = 1 to 10000 toArray
    val arraydata = arraydataInt.map(x => x.toDouble)
    val slideddata = arraydata.sliding(numCol).toSeq
    val rows = arraydata.sliding(numCol).map { x => Row(x: _*) }
    val datasetsize = arraydataInt.size

    val myrdd = ss.sparkContext.makeRDD(rows.toSeq, 128).persist()

    val schema =
      StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))

    val df = ss.createDataFrame(myrdd, schema).cache()

    val numRound = 100
    import ml.dmlc.xgboost4j.scala.spark.DataUtils._

    val assembler = new VectorAssembler()
      .setInputCols(Array("value1","value2","value3","value4","value5",
        "value6","value7","value8","value9","value10"))
      .setOutputCol("features")

    val trainset = assembler.transform(df).select("label", "features").persist()

    trainset.show()

    // training parameters
    val paramMap = List(
      "eta" -> 0.1f,
      "max_depth" -> 6,
      "objective" -> "reg:linear").toMap
    val xgboostModel = XGBoost.trainWithDataFrame(
      trainset, paramMap, numRound, nWorkers = 128, useExternalMemory = false)
    // xgboost-spark appends the column containing prediction results
    xgboostModel.transform(trainset).show()
  }

Error Stack:

Tracker started, with env={DMLC_TRACKER_URI=172.30.43.39, DMLC_NUM_WORKER=128, DMLC_TRACKER_PORT=9091, DMLC_NUM_SERVER=0}
[Stage 1:>                                                       (0 + 64) / 128]17/03/24 14:13:53 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, mapr326.sv.walmartlabs.com, executor 2): java.lang.UnsatisfiedLinkError: ml.dmlc.xgboost4j.java.XGBoostJNI.RabitInit([Ljava/lang/String;)I
    at ml.dmlc.xgboost4j.java.XGBoostJNI.RabitInit(Native Method)
    at ml.dmlc.xgboost4j.java.Rabit.init(Rabit.java:81)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:113)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:111)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

17/03/24 14:13:54 ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
17/03/24 14:13:54 ERROR RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 82, mapr326.sv.walmartlabs.com, executor 2): java.lang.UnsatisfiedLinkError: ml.dmlc.xgboost4j.java.XGBoostJNI.RabitInit([Ljava/lang/String;)I
    at ml.dmlc.xgboost4j.java.XGBoostJNI.RabitInit(Native Method)
    at ml.dmlc.xgboost4j.java.Rabit.init(Rabit.java:81)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:113)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:111)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:925)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anon$2.run(XGBoost.scala:295)
Caused by: java.lang.UnsatisfiedLinkError: ml.dmlc.xgboost4j.java.XGBoostJNI.RabitInit([Ljava/lang/String;)I
    at ml.dmlc.xgboost4j.java.XGBoostJNI.RabitInit(Native Method)
    at ml.dmlc.xgboost4j.java.Rabit.init(Rabit.java:81)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:113)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:111)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
CodingCat commented 7 years ago

are you sure you compiled the native library correctly? can you check the all logs before mvn package is finished

RhariharanExpedia commented 7 years ago

I had a similar issue that (perhaps coincidentally) went away when I compiled XGBoost on the same OS as machine we were running it on.

I had previously compiled with my Mac, but found that the resulting jar did not work on the AWS EC2 instance we run Spark on.

lserlohn commented 7 years ago

Hi all. Thanks for the help. It seems that something error during the test use nested groupData:

- test use nested groupData
RabitTrackerRobustnessSuite:
Tracker started, with env={DMLC_TRACKER_URI=172.30.43.39, DMLC_NUM_WORKER=8, DMLC_TRACKER_PORT=9091, DMLC_NUM_SERVER=0}

[Stage 0:>                                                          (0 + 8) / 8]17/03/27 15:02:31 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)
java.lang.RuntimeException: Worker exception.
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$4.apply(RabitTrackerRobustnessSuite.scala:77)
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$4.apply(RabitTrackerRobustnessSuite.scala:71)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
17/03/27 15:02:31 ERROR TaskSetManager: Task 7 in stage 0.0 failed 1 times; aborting job
17/03/27 15:02:31 ERROR RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): java.lang.RuntimeException: Worker exception.
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$4.apply(RabitTrackerRobustnessSuite.scala:77)
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$4.apply(RabitTrackerRobustnessSuite.scala:71)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:925)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anon$1.run(RabitTrackerRobustnessSuite.scala:86)
Caused by: java.lang.RuntimeException: Worker exception.
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$4.apply(RabitTrackerRobustnessSuite.scala:77)
    at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$4.apply(RabitTrackerRobustnessSuite.scala:71)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
    at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
lserlohn commented 7 years ago

BTW, is it ok that simply submit two Jar files, without deploying anything on the cluster nodes?

spark-submit --driver-memory 4g --conf 'spark.mesos.coarse=true' --jars xgboost4j-0.7-jar-with-dependencies.jar,xgboost4j-spark-0.7-jar-with-dependencies.jar xgboostspark_2.11-1.0.jar

CodingCat commented 7 years ago

test use nested groupData is constantly failing?

BTW, is it ok that simply submit two Jar files, without deploying anything on the cluster nodes? spark-submit --driver-memory 4g --conf 'spark.mesos.coarse=true' --jars xgboost4j-0.7-jar-with-dependencies.jar,xgboost4j-spark-0.7-jar-with-dependencies.jar xgboostspark_2.11-1.0.jar

as long as you get .so files packaged correctly, it shall be fine

lserlohn commented 7 years ago

test use nested groupData is constantly failing? I tired 3 times, and failed all the time. What maybe the cause?

Meanwhile, for the above program, I found, if the training size is too big, say: val arraydataInt = 1 to 10000 toArray

The program will be stuck at the training stage XGBoost.trainWithDataFrame(......)

However, if I change the training size to val arraydataInt = 1 to 1000 toArray

The program can pass through and return values..

What maybe the cause?

CodingCat commented 7 years ago

I just look at your log, why do you think test use nested groupData was failed? and the ERROR you see in the log is thrown by another test case intentionally

can you post the last few lines in your log?????

e.g.

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] xgboost-jvm ........................................ SUCCESS [  3.293 s]
[INFO] xgboost4j .......................................... SUCCESS [ 12.842 s]
[INFO] xgboost4j-spark .................................... SUCCESS [ 58.721 s]
[INFO] xgboost4j-flink .................................... SUCCESS [  3.217 s]
[INFO] xgboost4j-example .................................. SUCCESS [  2.665 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:20 min
[INFO] Finished at: 2017-03-28T19:45:53-07:00
[INFO] Final Memory: 33M/489M
[INFO] ------------------------------------------------------------------------
CodingCat commented 7 years ago

what do you mean by stuck? did you task run and finish? or nothing happened?

lserlohn commented 7 years ago

This is the info log after the train set is prepared.

trainset.show()

// training parameters
val paramMap = List(
  "eta" -> 0.1f,
  "max_depth" -> 6,
  "objective" -> "reg:linear").toMap
val xgboostModel = XGBoost.trainWithDataFrame(
  trainset, paramMap, numRound, nWorkers = 64, useExternalMemory = false)

It hang at the end, no more message came out.

From what I got from tests, if the "numRound" is too large, then it has higher probability to hang during the training ...

I am sure I have 64 CPUs, in 8 executors.

Because there is no message printout, I could not see where it is blocked. How can I see the training information within XGBoost4j?

BTW I can see the there are addition log INFO printout every 7 minutes, does that mean the training is still on? Where can I see the training progress?

17/03/29 14:27:21 INFO DAGScheduler: ShuffleMapStage 5 (repartition at XGBoost.scala:92) finished in 2.939 s 17/03/29 14:27:21 INFO DAGScheduler: looking for newly runnable stages 17/03/29 14:27:21 INFO DAGScheduler: running: Set() 17/03/29 14:27:21 INFO DAGScheduler: waiting: Set(ResultStage 6) 17/03/29 14:27:21 INFO DAGScheduler: failed: Set() 17/03/29 14:27:21 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[22] at mapPartitions at XGBoost.scala:110), which has no missing parents 17/03/29 14:27:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 6.1 KB, free 2.9 GB) 17/03/29 14:27:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 3.8 KB, free 2.9 GB) 17/03/29 14:27:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 172.30.43.39:47225 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:996 17/03/29 14:27:21 INFO DAGScheduler: Submitting 64 missing tasks from ResultStage 6 (MapPartitionsRDD[22] at mapPartitions at XGBoost.scala:110) 17/03/29 14:27:21 INFO TaskSchedulerImpl: Adding task set 6.0 with 64 tasks 17/03/29 14:27:21 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 385, sv1-hp0213-15.myserver.com, executor 5, partition 0, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 386, sv1-hp0210-11.myserver.com, executor 3, partition 1, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 2.0 in stage 6.0 (TID 387, sv1-hp0213-15.myserver.com, executor 5, partition 2, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 3.0 in stage 6.0 (TID 388, sv1-hp0210-11.myserver.com, executor 3, partition 3, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 4.0 in stage 6.0 (TID 389, sv1-hp0213-15.myserver.com, executor 5, partition 4, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 5.0 in stage 6.0 (TID 390, sv1-hp0210-11.myserver.com, executor 3, partition 5, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 6.0 in stage 6.0 (TID 391, sv1-hp0213-15.myserver.com, executor 5, partition 6, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 7.0 in stage 6.0 (TID 392, sv1-hp0210-11.myserver.com, executor 3, partition 7, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 8.0 in stage 6.0 (TID 393, sv1-hp0213-15.myserver.com, executor 5, partition 8, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 9.0 in stage 6.0 (TID 394, sv1-hp0210-11.myserver.com, executor 3, partition 9, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 10.0 in stage 6.0 (TID 395, sv1-hp0213-15.myserver.com, executor 5, partition 10, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 11.0 in stage 6.0 (TID 396, sv1-hp0210-11.myserver.com, executor 3, partition 11, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 12.0 in stage 6.0 (TID 397, sv1-hp0213-15.myserver.com, executor 5, partition 12, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 13.0 in stage 6.0 (TID 398, sv1-hp0210-11.myserver.com, executor 3, partition 13, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 14.0 in stage 6.0 (TID 399, sv1-hp0213-15.myserver.com, executor 5, partition 14, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO TaskSetManager: Starting task 15.0 in stage 6.0 (TID 400, sv1-hp0210-11.myserver.com, executor 3, partition 15, NODE_LOCAL, 6293 bytes) 17/03/29 14:27:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on sv1-hp0210-11.myserver.com:40869 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on sv1-hp0213-15.myserver.com:58018 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 16.0 in stage 6.0 (TID 401, mapr276.myserver.com, executor 6, partition 16, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 17.0 in stage 6.0 (TID 402, mapr358.myserver.com, executor 0, partition 17, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 18.0 in stage 6.0 (TID 403, sv1-hp0213-22.myserver.com, executor 4, partition 18, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 19.0 in stage 6.0 (TID 404, sv1-hp0206-06.myserver.com, executor 2, partition 19, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 20.0 in stage 6.0 (TID 405, mapr323.myserver.com, executor 7, partition 20, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 21.0 in stage 6.0 (TID 406, sv1-hp0211-04.myserver.com, executor 1, partition 21, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 22.0 in stage 6.0 (TID 407, mapr276.myserver.com, executor 6, partition 22, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 23.0 in stage 6.0 (TID 408, mapr358.myserver.com, executor 0, partition 23, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 24.0 in stage 6.0 (TID 409, sv1-hp0213-22.myserver.com, executor 4, partition 24, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 25.0 in stage 6.0 (TID 410, sv1-hp0206-06.myserver.com, executor 2, partition 25, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 26.0 in stage 6.0 (TID 411, mapr323.myserver.com, executor 7, partition 26, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 27.0 in stage 6.0 (TID 412, sv1-hp0211-04.myserver.com, executor 1, partition 27, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 28.0 in stage 6.0 (TID 413, mapr276.myserver.com, executor 6, partition 28, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 29.0 in stage 6.0 (TID 414, mapr358.myserver.com, executor 0, partition 29, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 30.0 in stage 6.0 (TID 415, sv1-hp0213-22.myserver.com, executor 4, partition 30, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 31.0 in stage 6.0 (TID 416, sv1-hp0206-06.myserver.com, executor 2, partition 31, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 32.0 in stage 6.0 (TID 417, mapr323.myserver.com, executor 7, partition 32, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 33.0 in stage 6.0 (TID 418, sv1-hp0211-04.myserver.com, executor 1, partition 33, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 34.0 in stage 6.0 (TID 419, mapr276.myserver.com, executor 6, partition 34, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 35.0 in stage 6.0 (TID 420, mapr358.myserver.com, executor 0, partition 35, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 36.0 in stage 6.0 (TID 421, sv1-hp0213-22.myserver.com, executor 4, partition 36, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 37.0 in stage 6.0 (TID 422, sv1-hp0206-06.myserver.com, executor 2, partition 37, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 38.0 in stage 6.0 (TID 423, mapr323.myserver.com, executor 7, partition 38, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 39.0 in stage 6.0 (TID 424, sv1-hp0211-04.myserver.com, executor 1, partition 39, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 40.0 in stage 6.0 (TID 425, mapr276.myserver.com, executor 6, partition 40, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 41.0 in stage 6.0 (TID 426, mapr358.myserver.com, executor 0, partition 41, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 42.0 in stage 6.0 (TID 427, sv1-hp0213-22.myserver.com, executor 4, partition 42, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 43.0 in stage 6.0 (TID 428, sv1-hp0206-06.myserver.com, executor 2, partition 43, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 44.0 in stage 6.0 (TID 429, mapr323.myserver.com, executor 7, partition 44, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 45.0 in stage 6.0 (TID 430, sv1-hp0211-04.myserver.com, executor 1, partition 45, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 46.0 in stage 6.0 (TID 431, mapr276.myserver.com, executor 6, partition 46, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 47.0 in stage 6.0 (TID 432, mapr358.myserver.com, executor 0, partition 47, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 48.0 in stage 6.0 (TID 433, sv1-hp0213-22.myserver.com, executor 4, partition 48, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 49.0 in stage 6.0 (TID 434, sv1-hp0206-06.myserver.com, executor 2, partition 49, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 50.0 in stage 6.0 (TID 435, mapr323.myserver.com, executor 7, partition 50, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 51.0 in stage 6.0 (TID 436, sv1-hp0211-04.myserver.com, executor 1, partition 51, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 52.0 in stage 6.0 (TID 437, mapr276.myserver.com, executor 6, partition 52, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 53.0 in stage 6.0 (TID 438, mapr358.myserver.com, executor 0, partition 53, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 54.0 in stage 6.0 (TID 439, sv1-hp0213-22.myserver.com, executor 4, partition 54, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 55.0 in stage 6.0 (TID 440, sv1-hp0206-06.myserver.com, executor 2, partition 55, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 56.0 in stage 6.0 (TID 441, mapr323.myserver.com, executor 7, partition 56, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 57.0 in stage 6.0 (TID 442, sv1-hp0211-04.myserver.com, executor 1, partition 57, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 58.0 in stage 6.0 (TID 443, mapr276.myserver.com, executor 6, partition 58, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 59.0 in stage 6.0 (TID 444, mapr358.myserver.com, executor 0, partition 59, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 60.0 in stage 6.0 (TID 445, sv1-hp0213-22.myserver.com, executor 4, partition 60, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 61.0 in stage 6.0 (TID 446, sv1-hp0206-06.myserver.com, executor 2, partition 61, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 62.0 in stage 6.0 (TID 447, mapr323.myserver.com, executor 7, partition 62, ANY, 6293 bytes) 17/03/29 14:27:24 INFO TaskSetManager: Starting task 63.0 in stage 6.0 (TID 448, sv1-hp0211-04.myserver.com, executor 1, partition 63, ANY, 6293 bytes) 17/03/29 14:27:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on sv1-hp0213-22.myserver.com:37057 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on mapr358.myserver.com:46792 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on sv1-hp0211-04.myserver.com:37767 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on mapr276.myserver.com:37461 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on sv1-hp0206-06.myserver.com:34848 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on mapr323.myserver.com:38076 (size: 3.8 KB, free: 2.9 GB) 17/03/29 14:27:25 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 10.93.128.16:45222 17/03/29 14:27:25 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 1107 bytes 17/03/29 14:27:25 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 10.93.128.65:34876 17/03/29 14:27:26 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 10.93.131.15:51192 17/03/29 14:27:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.30.32.183:52400 17/03/29 14:27:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.30.33.12:48351 17/03/29 14:27:28 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.30.33.164:37439 17/03/29 14:27:29 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.30.33.171:47850 17/03/29 14:27:29 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.30.33.69:38171

CodingCat commented 7 years ago

can you check your executor log (from Spark UI) to see if there is anything printed from xgboost (something like unreadable strings...:-( )

lserlohn commented 7 years ago

Could you please provide more information about executor log? I can see thread dump, but no other information under executor tag on Spark UI

CodingCat commented 7 years ago

you shall be able to check either from your file system of worker nodes directly

or through stderr of executors from history server (IIRC)

CodingCat commented 7 years ago

close for no update