amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

Can i compile SparkNet without using CUDA ( want to run with CPU) #138

Open prateekarora-git opened 8 years ago

prateekarora-git commented 8 years ago

Hi I compiled SparkNet successfully with Cuda 7.0 . but when i tried to run "Train Cifar using SparkNet" application its show me .

F0628 17:53:57.325634 29332 cudnn_conv_layer.cpp:52] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected * Check failure stack trace: *

My OS is Ubuntu 14.04 and running on virtual Machine and i don't have GPU support now . so can i test application with CPU without using Cuda?

if possible give me steps to compile and run application with CPU.

Regards Prateek

robertnishihara commented 8 years ago

Hi Prateek, take a look at the instructions in #110.

prateekarora-git commented 8 years ago

Thanks I compiled SparkNet for CPU cluster . then again tried to run "Train Cifar using SparkNet" application . this time i got error in native library libcaffe.so.1.0.0 at "sum at CifarApp.scala" stage.

Log Contents: Co# A fatal error has been detected by the [thread 140091143587584 al[t# ad 140091143587584 also had an error] SIGILL (0x4) at pc=0x00007f696f1b1221, pid=10808, tid=140090651174656 #

JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libcaffe.so.1.0.0-rc3+0x786221] sgemm_kernel+0x21

#

Failed to write core dump. Core dumps have been di also had an error]

#

An error report file with more information is saved as:

/yarn/nm/usercache/ubuntu/appcache/application_1467152377093_0019/container_1467152377093_0019_01_000002/hs_err_pid10808.log

#

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

pcmoritz commented 8 years ago

Did you create the JARs yourself or did you use the JARs we provide? There seems to be an error in the BLAS library, if you compiled it yourself, which one are you using?

prateekarora-git commented 8 years ago

yes i build my own jar file . I checkout SparkNet code from git clone https://github.com/amplab/SparkNet.git

then modified build.sbt (change the URL to snapshot-2016-03-16-CPU and change all of the instances of SPARKNET to SPARKNETCPU.).

I have my own spark 1.6 cluster running using cloudera 5.7.0.

then try to run application using

spark-submit --master yarn-cluster --num-executors 3 --driver-memory 4G --executor-memory 4G --conf spark.akka.frameSize=300 --class apps.CifarApp target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 3

pcmoritz commented 8 years ago

Oh I see, when I said "create the JAR yourself" I meant to ask if you followed this procedure: https://github.com/amplab/SparkNet/blob/master/doc/creating-jars.md

If you are running on cloudera with ubuntu 14.04 (if that is possible), using the procedure you described should work out of the box. If it uses a different distribution, you might have to follow the above procedure to make sure it works.

prateekarora-git commented 8 years ago

hi thanks for the information
. Yes , I am using cloudera 5.7.0 with Ubuntu 14.04 . cloudera distribution have spark 1.6.0 jar files and running spark with Yarn cluster . I have tested many spark application in to my cluster.

so as per my understanding you told that , the procedure i have used to compile SparkNet and running "Train Cifar using SparkNet" should work .

is any hint to solve my problem ?

Regards Prateek

prateekarora-git commented 8 years ago

one more thing i am using java version "1.7.0_101"

pcmoritz commented 8 years ago

On EC2 it works on Ubuntu 14.04. Did you start from a fresh image or might it be that there is another version of BLAS that causes problems? I'm happy to have a quick look at the log hs_err_pid10808.log to see if it contains more information, if you are willing to share that.

prateekarora-git commented 8 years ago

Attaached log file . hs_err_pid7119.docx

prateekarora-git commented 8 years ago

Hi I tried with fresh image and its working . but i want to run this on my existing cluster where issue is coming. i cant move all my previous work to new cluster

Regards Prateek

pcmoritz commented 8 years ago

Is it the same software versions on both the fresh image and your existing cluster? Do you have any other BLAS libraries installed on your existing cluster? My guess right now is that a different blas is loaded at runtime. The log is not very helpful unfortunately.