almond-sh / almond

A Scala kernel for Jupyter
https://almond.sh
BSD 3-Clause "New" or "Revised" License
1.59k stars 239 forks source link

Cluster Spark with EMR 6.1 / Hadoop 3 #705

Open bluesheeptoken opened 3 years ago

bluesheeptoken commented 3 years ago

Hello,

First of all thanks for maintaining this kernel, I appreciate it.

I would like to use Hadoop 3 on our clusters, I was wondering if there was a way to launch a Spark Session with Hadoop 3 dependency?

I have recompiled Spark with profile Hadoop 3, added hadoop-aws in the dependency. But the hadoop jars given to Spark are still in version 2.

Is there a way to force the version in almond 0.10.6?

For reference:

spark.sparkContext.getConf.get("spark.jars").split(',').filter(_.contains("org/apache/hadoop/"))
 /*"file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.4/hadoop-client-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/2.7.4/hadoop-common-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/2.7.4/hadoop-hdfs-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-app/2.7.4/hadoop-mapreduce-client-app-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-api/2.7.4/hadoop-yarn-api-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core/2.7.4/hadoop-mapreduce-client-core-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.7.4/hadoop-mapreduce-client-jobclient-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-annotations/2.7.4/hadoop-annotations-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/2.7.4/hadoop-auth-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-common/2.7.4/hadoop-mapreduce-client-common-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.7.4/hadoop-mapreduce-client-shuffle-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.7.4/hadoop-yarn-common-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-client/2.7.4/hadoop-yarn-client-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-common/2.7.4/hadoop-yarn-server-common-2.7.4.jar",
  "file:/home/hadoop/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-nodemanager/2.7.4/hadoop-yarn-server-nodemanager-2.7.4.jar"*/
lanking520 commented 3 years ago

I think EMR 6.1 is going with Hadoop 2.7 (which is the default for Spark 3.0).

bluesheeptoken commented 3 years ago

Emr 6.x is currently going with Hadoop 3.2.1, cf: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html