almond-sh / almond

A Scala kernel for Jupyter
https://almond.sh
BSD 3-Clause "New" or "Revised" License
1.6k stars 241 forks source link

Problem working with hadoop-azure-datalake dependencies #332

Open rvilla87 opened 5 years ago

rvilla87 commented 5 years ago

Hello,

I want to read some files from Azure Datalake (ADL) filesystem thought Jupyter using almond kernel (Scala) and I'm having problems with some dependencies.

Important to note that doing the same thing with Intellij Idea I have no problem at all (but it lacks a good notebook system).

This is the code I use in Jupyter with almond:

import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark").master("local[*]").getOrCreate()

import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-azure-datalake:3.1.1`

import org.apache.hadoop.fs.adl.AdlFileSystem

val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

spark.read.parquet("adl://testing.azuredatalakestore.net/testing")

When executing this code I get this error: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V

We need spark-sql in order to start sparkSession, then hadoop-common to set the filesystem for Azure Datalake (wihich uses org.apache.hadoop.conf.Configuration class to reload config) and finally hadoop-azure-datalake to use the class AdlFileSystem.

Regarding the error, I don't really know why I am getting that error. The class org.apache.hadoop.conf.Configuration it's only in hadoop-common package and it has the method reloadExistingConfigurations().

I have been searching in google and found this issue but it's related with an old hadoop 2.7.3 version which doesn't has this method.

Thanks in advance!

alexarchambault commented 5 years ago

I'd recommend loading the hadoop JARs before spark itself, like

import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-azure-datalake:3.1.1`

import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.spark.sql.SparkSession
val spark = NotebookSparkSession.builder().appName("Spark").master("local[*]").getOrCreate()

import org.apache.hadoop.fs.adl.AdlFileSystem

val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

spark.read.parquet("adl://testing.azuredatalakestore.net/testing")

to prevent a lower version to be loaded via spark.

Also, you should use NotebookSparkSession instead of SparkSession. The former passes some details from the notebook to spark (results of compiling the input code in particular).

rvilla87 commented 5 years ago

Yeah, loading the Hadoop JARs before Spark itself worked! 👍

If you want I can close the issue, but I have some minor questions regarding your solution:

import $ivy.`sh.almond:almond-spark_2.11:0.3.1`
import org.apache.spark.sql.NotebookSparkSession
val spark = NotebookSparkSession.builder().appName("Spark").master("local").getOrCreate()

And when executing the spark sentence it returns the error:

cmd3.sc:1: Symbol 'term ammonite.interp' is missing from the classpath.
This symbol is required by 'value org.apache.spark.sql.NotebookSparkSession.interpApi'.
Make sure that term interp is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'NotebookSparkSession.class' was compiled against an incompatible version of ammonite.
val spark = NotebookSparkSession.builder().appName("Spark").master("local").getOrCreate()

Anyway, almond-spark uses spark 2.01 or a lower version than 2.4, right? Sorry, I don't recall where I saw it right a few hours ago in my pc (now I'm writing on mobile).

Do you have some documentation where I can read about the difference of using the notebook Session instead the default one? I used the default one since some years with no problems and I found interesting what you say.

Thank you very much!