amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

Error : Check failed: error == cudaSuccess (30 vs. 0) unknown error #120

Open rahulbhalerao001 opened 8 years ago

rahulbhalerao001 commented 8 years ago

I had created a private AMI from a running code (after the cache changes), and the Imagenet example was running correctly on this AMI.

However, today I created a new cluster from this AMI and got the error - "Check failed: error == cudaSuccess (30 vs. 0) unknown error".

robertnishihara commented 8 years ago

Did you get the error when running ImageNetApp.scala? Was that the only difference? What sort of nodes were you using, and what OS? Were you using the current SparkNet master, or did you modify anything?

rahulbhalerao001 commented 8 years ago

Yes Imagenet.scala. After posting, I pulled, rebuilt, and re ran with same result. Node - g2.8xlarge with ubuntu 14.04 (same as the AMI provided here) No, I did not modify.

rahulbhalerao001 commented 8 years ago

I made a private AMI, because I wanted to stick to one version of the code.

robertnishihara commented 8 years ago

I'm not sure exactly what the problem is, but one starting point is to figure out exactly where the error is occuring. A couple ways to do this:

  1. Start a spark shell with something like ~/spark/bin/spark-shell --jars /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar and try loading a model from a model file and creating a net and calling net.forward, and see precisely where it crashes. For this purpose, you can do all of this on the Spark master and you don't need to create a net on each worker.
  2. Sometimes running things in the Spark shell is different from running a script, so I'd suggest commenting out components of CifarApp.scala until you stop getting the error to find the minimal example that causes it to fail.

By the way, you said the problem was with ImageNetApp, but does it also occur with CifarApp?