Open saint1991 opened 6 years ago
I'd say this is the most common deployment type (i.e. Spark being provided by the container) for businesses.
@aishfenton I agree… Yet this poses a number of challenges.
When running spark calculations from the kernel, it acts as the driver. Its classpath is the one of almond, plus the user-added dependencies. If one relies on a spark distribution, the classpath of the executors corresponds to jars in the spark distribution (plus those passed as spark.jars
I think).
That means the classpath on the driver (almond) and the executors (spark distrib) don't necessarily match.
I ran in numerous issues even with (very) minor differences between the driver and executor classpaths (like two versions of the JAR of scala-library landing in the executor classpath, something like 2.11.2 and 2.11.7 IIRC, making List deserialization fail).
In the past, I circumvented that by using a vendored spark version as a Maven dependency from almond (rather than via a spark distribution), and only using spark configuration files from the spark distribution.
Yet @dynofu seems to have successfully used a spark distribution via ammonite-spark. I don't know how far he went though…
you can take a look my scripts build on top of Ammonite-spark https://github.com/dyno/ammonite_with_spark_on_emr
and the spark.jars
will use whatever on the emr by ammonite-spark https://github.com/dyno/ammonite_with_spark_on_emr/blob/master/emr.sc#L33.
If one were to get a spark distribution working via ammonite-spark, what more would be needed to be able to get the same functionality surfaced within an almond kernel?
Are there any ways to use provided Spark instead of downloading it on a notebook? In my case, install Jupyter on Dataproc where Spark package is provided.
It seems to be possible if SPARK_HOME can be configured.