amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

error while running CifarApp #130

Open prakhar21 opened 8 years ago

prakhar21 commented 8 years ago

When I am running the CifarApp on SparkCluster, the following error comesup:

16/06/08 12:50:04 INFO DAGScheduler: ResultStage 14 (foreach at CifarApp.scala:105) failed in 0.040 s 16/06/08 12:50:04 INFO DAGScheduler: Job 8 failed: foreach at CifarApp.scala:105, took 0.049292 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 43, localhost): java.lang.ArrayIndexOutOfBoundsException

robertnishihara commented 8 years ago

Looks like that is the line

 workers.foreach(_ => workerStore.get[CaffeSolver]("solver").trainNet.setWeights(broadcastWeights.value))

It's possible that the lookup workerStore.get[CaffeSolver] is failing. So perhaps try just

 workers.foreach(_ => workerStore.get[CaffeSolver]("solver"))

and see if that succeeds or fails.

If that is failing, it may be that some worker does not have a net on it. How many nodes are you using? And what are you passing into CifarApp for the number of workers?

hckuo2 commented 7 years ago

@robertnishihara Did that but the following errors raised.

F1009 04:25:49.868021  8028 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000)
*** Check failure stack trace: ***
F1009 04:25:49.868021  8027 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000) F1009 04:25:49.868021  8029 blob.cpp:21] Check failed: count_ == other.count() (1000000 vs. 100)
*** Check failure stack trace: ***
Aborted