Problem working with hadoop-azure-datalake dependencies

Hello,

I want to read some files from Azure Datalake (ADL) filesystem thought Jupyter using almond kernel (Scala) and I'm having problems with some dependencies.

Important to note that doing the same thing with Intellij Idea I have no problem at all (but it lacks a good notebook system).

This is the code I use in Jupyter with almond:

import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark").master("local[*]").getOrCreate()

import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-azure-datalake:3.1.1`

import org.apache.hadoop.fs.adl.AdlFileSystem

val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

spark.read.parquet("adl://testing.azuredatalakestore.net/testing")

When executing this code I get this error: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V

We need spark-sql in order to start sparkSession, then hadoop-common to set the filesystem for Azure Datalake (wihich uses org.apache.hadoop.conf.Configuration class to reload config) and finally hadoop-azure-datalake to use the class AdlFileSystem.

Regarding the error, I don't really know why I am getting that error. The class org.apache.hadoop.conf.Configuration it's only in hadoop-common package and it has the method reloadExistingConfigurations().

I have been searching in google and found this issue but it's related with an old hadoop 2.7.3 version which doesn't has this method.

Thanks in advance!

I'd recommend loading the hadoop JARs before spark itself, like

import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-azure-datalake:3.1.1`

import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.spark.sql.SparkSession
val spark = NotebookSparkSession.builder().appName("Spark").master("local[*]").getOrCreate()

import org.apache.hadoop.fs.adl.AdlFileSystem

val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")

spark.read.parquet("adl://testing.azuredatalakestore.net/testing")

to prevent a lower version to be loaded via spark.

Also, you should use NotebookSparkSession instead of SparkSession. The former passes some details from the notebook to spark (results of compiling the input code in particular).

Yeah, loading the Hadoop JARs before Spark itself worked! 👍

If you want I can close the issue, but I have some minor questions regarding your solution:

Why the loading order matters in this case? Spark-sql doesn't have the reloadExistingConfigurations method and doesn't even belongs to hadoop package. I mean, how can I know what libraries to load before others? It's related to ivy? As I said, with Intellij Idea using SBT works OK and I load spark-sql before hadoop.
About using NotebookSparkSession I have to load another JAR, right?

import $ivy.`sh.almond:almond-spark_2.11:0.3.1`
import org.apache.spark.sql.NotebookSparkSession
val spark = NotebookSparkSession.builder().appName("Spark").master("local").getOrCreate()

And when executing the spark sentence it returns the error:

cmd3.sc:1: Symbol 'term ammonite.interp' is missing from the classpath.
This symbol is required by 'value org.apache.spark.sql.NotebookSparkSession.interpApi'.
Make sure that term interp is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'NotebookSparkSession.class' was compiled against an incompatible version of ammonite.
val spark = NotebookSparkSession.builder().appName("Spark").master("local").getOrCreate()

Anyway, almond-spark uses spark 2.01 or a lower version than 2.4, right? Sorry, I don't recall where I saw it right a few hours ago in my pc (now I'm writing on mobile).

Do you have some documentation where I can read about the difference of using the notebook Session instead the default one? I used the default one since some years with no problems and I found interesting what you say.

Thank you very much!

almond-sh / almond

Problem working with hadoop-azure-datalake dependencies #332