databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

cannot use the spark-deep-learning library in cluster mode #111

Closed nadamaguid closed 6 years ago

nadamaguid commented 6 years ago

I've been trying to set up a cluster with EMR using 5.7.0 with spark 2.1.1 and python2 to use the spark deep learning library to read images which I then intend to use with the dist-keras library. I can't get past the spark session creation when using a cluster mode. I've tried supplying the library using the pyspark --packages argument as well as a number of the spark options including spark.jar.packages, spark.yarn.jars, spark.yarn.dist.jars and I've also attempted to download and manually add the jar to the jars folder in spark and then zip the folder and upload it to HDFS and supply the path as an argument for spark.yarn.archive, but I get an error that the path to HDFS is invalid. When using all other options I get the following error:

2018-04-04 22:25:08 ERROR SparkContext:91 - Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) at org.apache.spark.SparkContext.<init>(SparkContext.scala:500) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)

Please please help ... Im supposed to be working on a project and my deadline is approaching and i can't seem to crack this for weeks now

Thanks in advance.

nadamaguid commented 6 years ago

UPDATE

So I've let been doing the above stated options with jupyter and I have now attempted to simply submit the python file through the spark-submit and I've added used the following command to submit my file: bin/spark-submit --jars /home/hadoop/spark-deep-learning-0.2.0-spark2.1-s_2.11.jar --driver-class-path /home/hadoop/spark-deep-learning-0.2.0-spark2.1-s_2.11.jar /home/hadoop/RunningCode5.py

I keep getting the Import error as such: Traceback (most recent call last): File "/home/hadoop/RunningCode5.py", line 74, in <module> from sparkdl import KerasImageFileTransformer ImportError: No module named sparkdl and the code is exited. From previous runs, I came to the conclusion that this error only comes up if the jar file is inaccessible or it has been added a number of times and so is incorrectly used. But, as per my understanding from the explanation provided in here, the jar needs to be passed on to the workers and to be clearly identified for the executor. I have done exactly that and I haven't manually copied the jar anywhere else or even added it to hdfs.

What is going on??

Gauravshah commented 5 years ago

@nadamaguid were you able to move forwards, facing similar issue