amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
604 stars 171 forks source link

Question about disabling GPU #52

Closed jaewoosong closed 8 years ago

jaewoosong commented 8 years ago

Good afternoon!

May I ask how to disable GPU in SparkNet? To be specific, I want to try CifarApp.scala under various conditions. Will setting solver_mode=CPU in cifar10_full_solver.prototxt be enough for turning GPU off in SparkNet? Or should I change more settings?

Thanks for your help. I am actively learning deep learning with SparkNet!

robertnishihara commented 8 years ago

I think the easiest way is to replace all of the instances of caffeLib.set_mode_gpu() with caffeLib.set_mode_cpu() in src/main/scala/libs/Net.scala. See #50.

Let me know if that works for you!

jaewoosong commented 8 years ago

Thank you! It worked.

So it seems that regardless which ("solver_mode: GPU" or "solver_mode: CPU") is written in the cifar10_full_solver.prototxt, SparkNet will run with the GPU/CPU setting in Net.scala file. Does it?

SparkNet ran successfully when I used only one node. But it failed when I assigned two or more nodes. Because I am using SparkNet on my Ubuntu desktop, I was curious whether the failure was because of only one GPU in my computer. My assumption was that different Spark nodes tried to use the only one GPU at the same time.

So I tried CPU-only mode. But sadly there was also an error in CPU-only mode. Using one node works fine, but using two or more nodes fails. There are several different error messages, which I wrote at the end of this writing. Would it be okay to ask your help? I used the CifarApp with the default settings.

Thank you so much for sharing your code as an open source. This is so much helpful!

(1) F0203 13:18:47.811198 9811 split_layer.cpp:21] Check failed: count_ == top[i]->count() (10000 vs. 100)

(2) # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fcf7c40e619, pid=5044, tid=140529144248064 # # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libccaffe.so+0x376619] float const& std::max<float>(float const&, float const&)+0x10 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

(3)

[Stage 12:> (0 + 2) / 2] JNA: Callback libs.CaffeNet$$anon$2@777cde68 threw the following exception: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at libs.MinibatchSampler.nextMinibatch(MinibatchSampler.scala:30) at libs.MinibatchSampler.nextLabelMinibatch(MinibatchSampler.scala:50) at libs.CaffeNet$$anon$2.invoke(Net.scala:226) at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:485) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:515) at com.sun.jna.Native.invokeVoid(Native Method) at com.sun.jna.Function.invoke(Function.java:374) at com.sun.jna.Function.invoke(Function.java:323) at com.sun.jna.Library$Handler.invoke(Library.java:236) at com.sun.proxy.$Proxy15.solver_test(Unknown Source) at libs.CaffeNet.test(Net.scala:111) at apps.CifarApp$$anonfun$5.apply(CifarApp.scala:114) at apps.CifarApp$$anonfun$5.apply(CifarApp.scala:108) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

JNA: Callback libs.CaffeNet$$anon$1@7c4772ef threw the following exception: java.lang.ArrayIndexOutOfBoundsException: 50 at libs.MinibatchSampler.nextMinibatch(MinibatchSampler.scala:27) at libs.MinibatchSampler.nextImageMinibatch(MinibatchSampler.scala:38) at libs.CaffeNet$$anon$1.invoke(Net.scala:197) at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:485) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:515) at com.sun.jna.Native.invokeVoid(Native Method) at com.sun.jna.Function.invoke(Function.java:374) at com.sun.jna.Function.invoke(Function.java:323) at com.sun.jna.Library$Handler.invoke(Library.java:236) at com.sun.proxy.$Proxy15.solver_test(Unknown Source) at libs.CaffeNet.test(Net.scala:111) at apps.CifarApp$$anonfun$5.apply(CifarApp.scala:114) at apps.CifarApp$$anonfun$5.apply(CifarApp.scala:108) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

(Sorry that I accidentally closed the thread and reopened it.)

robertnishihara commented 8 years ago

Thanks for sharing the error messages! Interesting, we haven't done too much experimentation locally, so I wouldn't necessarily expect it to work with more than one node locally at the moment. We'll look into this.

pcmoritz commented 8 years ago

Hey jaewoosong,

if you want to run say 2 workers on one node, you have to set

export SPARK_WORKER_INSTANCES=2

in spark/conf/spark-env.sh and then start the master and slaves like this:

./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh spark://localhost:7077

After that, running the following works for me (you have to change the filepath to make it work on your machine):

./spark-submit --master local --class apps.CifarApp --driver-java-options -Djna.nosys=true /home/pcmoritz/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 2

You are right that the GPU mode is set explicitly which overrides the mode from the solver file; we are going to change that in the next version.

All the best, Philipp.

jaewoosong commented 8 years ago

Thank you, Robert and Philipp!