amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

exception trying to run Mnist example #121

Open dumoulma opened 8 years ago

dumoulma commented 8 years ago


I'm trying to run SparkNet on a MapR cluster running Spark 1.5.2 I can get Caffe to run locally, including python bindings, and the SparkNet assembly is using the SPARKNETCPU artefacts (with JavaCPP on the 03-16 version as indicated in another post.

the job starts up and completes Stage 3 successfully but then throws an exception: 16/04/10 10:18:52 WARN TaskSetManager: Lost task 3.0 in stage 14.0 (TID 41, java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at libs.JavaNDArray.baseFlatInto( at libs.JavaNDArray.recursiveFlatInto( at libs.JavaNDArray.recursiveFlatInto( at libs.JavaNDArray.flatCopy( at libs.JavaNDArray.toFlat( at libs.NDArray.toFlat(NDArray.scala:32) at libs.TensorFlowUtils$.tensorFromNDArray(TensorFlowUtils.scala:71) at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:114) at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:112) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102) at libs.TensorFlowNet.setWeights(TensorFlowNet.scala:112) at apps.MnistApp$$anonfun$main$4.apply$mcVI$sp(MnistApp.scala:96) at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96) at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at at org.apache.spark.executor.Executor$ at java.util.concurrent.ThreadPoolExecutor.runWorker( at java.util.concurrent.ThreadPoolExecutor$ at

Any help would be greatly appreciated.

Note: the Cifar example also fails with what seems to be the exact same error.

robertnishihara commented 8 years ago

Just to be sure, can you tell me what command you're running to launch the Mnist app? Also, did you download the Mnist data with Sparknet/data/mnist/ Similarly, did you download the Cifar data with Sparknet/data/cifar10/

dumoulma commented 8 years ago

Yes, I used and use the command as shown on the readme --class apps.CifarApp path/to/Sparknet-jar-with-deps.jar 2

robertnishihara commented 8 years ago

I'd suggest running the individual commands from a Spark shell and seeing specifically where the error occurs. Also, are there any error messages on the workers?

dumoulma commented 8 years ago

The error happens after the data is loaded. The Caffe network config loads and runs a bit then it crashes with the ArrayOutOfBounds. CaffeOnSpark runs without issues on that EC2 instance (m3.xlarge) running a spark 1.6 or 1.5 standalone with 2 workers.

robertnishihara commented 8 years ago

Since it's on EC2, if you want to share the image with us, it'd be easy for us to look into it.

It should work fine on the image that we provide (in the readme).

abongLee commented 8 years ago

did you solve the problem, I have a simliar problem , the SparkNet assembly is also using the SPARKNETCPU , and crash with the same exception ArrayOutBounds

dumoulma2 commented 8 years ago

I have not. I was severely pressed for time and got CoffeOnSpark working and decided to go with that one. I would still like to get SparkNet working though.

pcmoritz commented 8 years ago

Hey, thanks for keeping us updated. I think I can reproduce the problem now, it seems to occur in local mode with more than one SparkNet worker; that is not a regime we typically use, so we haven't run into it yet. I'll keep you updated if I find out why the problem occurs.