amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

Error while running CAFFE Cifar-10 #106

Closed rahulbhalerao001 closed 8 years ago

rahulbhalerao001 commented 8 years ago

I am sorry for opening another error thread, but I am trying the new AMI d0833da3 and I got the following error (pulled and rebuilt) while running CIFAR-10 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 12.0 failed 1 times, most recent failure: Lost task 3.0 in stage 12.0 (TID 355, 172.31.20.36): ExecutorLostFailure (executor 2 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:890) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:888) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.foreach(RDD.scala:888) at apps.CifarApp$.main(CifarApp.scala:82) at apps.CifarApp.main(CifarApp.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

In the stdout of the executors I got this

` A fatal error has been detected by the Java Runtime Environment: SIGSEGV (0xb) at pc=0x00007f9b11147be0, pid=9265, tid=140304540440320

JRE version: OpenJDK Runtime Environment (7.0_95) (build 1.7.0_95-b00) Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 ) Derivative: IcedTea 2.6.4 Distribution: Ubuntu 14.04.3 LTS, package 7u95-2.6.4-0ubuntu0.14.04.1 Problematic frame: C [libcaffe.so.1.0.0-rc3+0x2e2be0] caffe::Caffe::RNG::generator()+0x0

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as: /root/spark/work/app-20160308211614-0002/3/hs_err_pid9265.log

If you would like to submit a bug report, please include instructions on how to reproduce the bug and visit: http://icedtea.classpath.org/bugzilla

`

The dump file is attached in case useful. hs_err_pid9541.log.txt

rahulbhalerao001 commented 8 years ago

Cluster creation command

SparkNet/ec2/spark-ec2 --key-pair=newtrial \ 449 --identity-file=newtrial.pem \ 450 --region=eu-west-1 \ 451 --zone=eu-west-1c \ 452 --instance-type=g2.8xlarge \ 453 --ami=ami-d0833da3 \ 454 --copy-aws-credentials \ 455 --spark-version=1.5.0 \ 456 --no-ganglia \ 457 --user-data SparkNet/ec2/cloud-config.txt \ 458 --slaves=5 \ 459 launch sparknet

Job Submit command

/root/spark/bin/spark-submit --class apps.CifarApp /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 5

robertnishihara commented 8 years ago

Thanks Rahul,

Can you do a git pull and sbt assembly to make sure you're using the most up-to-date code? Also, can you run sbt "test-only CaffeNetSpec" and tell me if the tests pass or fail?

rahulbhalerao001 commented 8 years ago

I had done git pull and sbt assembly, and it was successful. I guess all the tests were also succesful, but I will specifically run only the CaffeNetSpec tests and report the output.

rahulbhalerao001 commented 8 years ago

root@ip-172-31-20-165:~/SparkNet# sbt "test-only CaffeNetSpec" Picked up _JAVA_OPTIONS: -Xmx8g [info] Loading project definition from /root/SparkNet/project [info] Set current project to sparknet (in build file:/root/SparkNet/) [info] CaffeNetSpec: [info] NetParam [info] - should be loaded !!! IGNORED !!! [info] CaffeNet [info] - should be created !!! IGNORED !!! [info] CaffeNet [info] - should call forward !!! IGNORED !!! [info] CaffeNet [info] - should call forwardBackward !!! IGNORED !!! [info] Calling forward [info] - should leave weights unchanged !!! IGNORED !!! [info] Calling forwardBackward [info] - should leave weights unchanged !!! IGNORED !!! [info] Saving and loading the weights [info] - should leave the weights unchanged !!! IGNORED !!! [info] Run completed in 248 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [success] Total time: 1 s, completed Mar 9, 2016 12:01:35 AM

robertnishihara commented 8 years ago

Right, I added the line @Ignore in src/test/scala/libs/CaffeNetSpec.scala because there were some issues with running the Caffe tests and TensorFlow tests at the same time. Could you comment out that line, and run sbt "test-only CaffeNetSpec" again?

rahulbhalerao001 commented 8 years ago

Thank you for helping on this. Here are the results :

[info] - should leave the weights unchanged [info] Run completed in 2 seconds, 517 milliseconds. [info] Total number of tests run: 7 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 4 s, completed Mar 9, 2016 12:54:58 AM

robertnishihara commented 8 years ago

I remember running into that error before when using Caffe with GPUs. Commenting out the line Caffe.set_mode(Caffe.GPU) in CifarApp.scala should solve the problem (but will also disable GPUs).

We recently rebuilt the jars for Caffe in a way that seemed to get rid of the problem. Is it possible that you're using the old jars? You might try removing SparkNet/targets and /root/.ivy2/cache/org.bytedeco* and recompiling.

Also, TensorFlow works well with GPUs, so you might give that a try.

There is a bunch of discussion about this in bytedeco/javacpp-presets#147.

rahulbhalerao001 commented 8 years ago

Hello Robert,

Thank you for your suggestions. After removing the two directories you mentioned I tried building again but it failed. But readding the @Ignore to caffe test file, allowed the build to succeed and now the training is going on successfully. So, I guess it was problem with some stale dependent libraries.

I am observing that Caffe is able to utilize only one GPU on the machine. Is multiple GPU supported for the wrapped Caffe in SparkNet?

I will be interested in looking into the wrapped Tensorflow's performance too.

robertnishihara commented 8 years ago

We don't have any example apps using multiple GPUs, but it should be possible to get it working (we had it working before the switch to JavaCPP).

rahulbhalerao001 commented 8 years ago

Ok thank you for the information. Closing this thread.