cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
624 stars 169 forks source link

Problem with running mnist.py example #48

Open menon92 opened 6 years ago

menon92 commented 6 years ago

I am trying to run mnist.py with standalone cluster mode. for this I set master = "local[*] to "spark://MY_MASTER_IP:7077" and I submit my task by following command

 spark-submit --master spark://192.168.1.101:7077 \
--num-executors 1\
--executor-memory 8G \
--executor-cores 2 \
--driver-memory 8G PATH_TO/mnist.py

but I get the following error

INFO DAGScheduler: ResultStage 1 (load at NativeMethodAccessorImpl.java:0) failed in 2.200 s due to Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 13, 192.168.1.102, executor 1): java.io.FileNotFoundException: File file:/home/semanticslab3/development/spark/spark-2.2.1-bin-hadoop2.7/bin/data/mnist_train.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
    at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1113)
    at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1113)
    at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125)
    at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

     Driver stacktrace:
    17/12/14 16:29:32 INFO DAGScheduler: Job 1 failed: load at NativeMethodAccessorImpl.java:0, took   2.211758 s

With master = "local[*] it works nice but it use only on worker. I want to use to two of my worker

Thanks

JoeriHermans commented 6 years ago

Hi,

Did you modify the example script in any way? Besides changing the master address? Because the script is not intended to be submitted through spark-submit.

Joeri

menon92 commented 6 years ago

Hello @JoeriHermans ,

Thanks for you quick replay. I just add an import statement from pyspark.sql import SparkSession in mnist.py and changing the master address. nothing more then that.

you said, script is not intended to be submitted through spark-submit then why this code is running when I use local[*] as master address.

if the script is not intended to be submitted through spark-submitI then how can i run this mnist.py so that it run over two cluster node .

or It is not possible to run in cluster ?

Thanks