almond-sh / almond

A Scala kernel for Jupyter
https://almond.sh
BSD 3-Clause "New" or "Revised" License
1.6k stars 241 forks source link

Is it possible to use provided spark? #197

Open saint1991 opened 6 years ago

saint1991 commented 6 years ago

Are there any ways to use provided Spark instead of downloading it on a notebook? In my case, install Jupyter on Dataproc where Spark package is provided.

It seems to be possible if SPARK_HOME can be configured.

aishfenton commented 6 years ago

I'd say this is the most common deployment type (i.e. Spark being provided by the container) for businesses.

alexarchambault commented 6 years ago

@aishfenton I agree… Yet this poses a number of challenges.

When running spark calculations from the kernel, it acts as the driver. Its classpath is the one of almond, plus the user-added dependencies. If one relies on a spark distribution, the classpath of the executors corresponds to jars in the spark distribution (plus those passed as spark.jars I think).

That means the classpath on the driver (almond) and the executors (spark distrib) don't necessarily match.

I ran in numerous issues even with (very) minor differences between the driver and executor classpaths (like two versions of the JAR of scala-library landing in the executor classpath, something like 2.11.2 and 2.11.7 IIRC, making List deserialization fail).

In the past, I circumvented that by using a vendored spark version as a Maven dependency from almond (rather than via a spark distribution), and only using spark configuration files from the spark distribution.

Yet @dynofu seems to have successfully used a spark distribution via ammonite-spark. I don't know how far he went though…

dynofu commented 6 years ago

you can take a look my scripts build on top of Ammonite-spark https://github.com/dyno/ammonite_with_spark_on_emr and the spark.jars will use whatever on the emr by ammonite-spark https://github.com/dyno/ammonite_with_spark_on_emr/blob/master/emr.sc#L33.

mpacer commented 6 years ago

If one were to get a spark distribution working via ammonite-spark, what more would be needed to be able to get the same functionality surfaced within an almond kernel?