Open rvilla87 opened 5 years ago
I'd recommend loading the hadoop JARs before spark itself, like
import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-azure-datalake:3.1.1`
import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.spark.sql.SparkSession
val spark = NotebookSparkSession.builder().appName("Spark").master("local[*]").getOrCreate()
import org.apache.hadoop.fs.adl.AdlFileSystem
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.read.parquet("adl://testing.azuredatalakestore.net/testing")
to prevent a lower version to be loaded via spark.
Also, you should use NotebookSparkSession
instead of SparkSession
. The former passes some details from the notebook to spark (results of compiling the input code in particular).
Yeah, loading the Hadoop JARs before Spark itself worked! 👍
If you want I can close the issue, but I have some minor questions regarding your solution:
reloadExistingConfigurations
method and doesn't even belongs to hadoop package. I mean, how can I know what libraries to load before others? It's related to ivy? As I said, with Intellij Idea using SBT works OK and I load spark-sql before hadoop.NotebookSparkSession
I have to load another JAR, right?import $ivy.`sh.almond:almond-spark_2.11:0.3.1`
import org.apache.spark.sql.NotebookSparkSession
val spark = NotebookSparkSession.builder().appName("Spark").master("local").getOrCreate()
And when executing the spark sentence it returns the error:
cmd3.sc:1: Symbol 'term ammonite.interp' is missing from the classpath.
This symbol is required by 'value org.apache.spark.sql.NotebookSparkSession.interpApi'.
Make sure that term interp is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'NotebookSparkSession.class' was compiled against an incompatible version of ammonite.
val spark = NotebookSparkSession.builder().appName("Spark").master("local").getOrCreate()
Anyway, almond-spark uses spark 2.01 or a lower version than 2.4, right? Sorry, I don't recall where I saw it right a few hours ago in my pc (now I'm writing on mobile).
Do you have some documentation where I can read about the difference of using the notebook Session instead the default one? I used the default one since some years with no problems and I found interesting what you say.
Thank you very much!
Hello,
I want to read some files from Azure Datalake (ADL) filesystem thought Jupyter using almond kernel (Scala) and I'm having problems with some dependencies.
Important to note that doing the same thing with Intellij Idea I have no problem at all (but it lacks a good notebook system).
This is the code I use in Jupyter with almond:
When executing this code I get this error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
We need spark-sql in order to start sparkSession, then hadoop-common to set the filesystem for Azure Datalake (wihich uses
org.apache.hadoop.conf.Configuration
class to reload config) and finally hadoop-azure-datalake to use the classAdlFileSystem
.Regarding the error, I don't really know why I am getting that error. The class
org.apache.hadoop.conf.Configuration
it's only inhadoop-common
package and it has the methodreloadExistingConfigurations()
.I have been searching in google and found this issue but it's related with an old hadoop 2.7.3 version which doesn't has this method.
Thanks in advance!