Closed acezen closed 3 months ago
Did you check this part of spark docs? Which version of Hadoop are you using?
Anyway, I do not think we should explicitly add any AWS-commiter dependencies. As I know different spark distributions (Databricks, Cloudera, EMR, etc.) are using different implementations of S3-commiter. Adding any AWS dependency to GraphAr explicitly may tend to dependency hell for some users.
I know at least s3a
, s3n
(separete dependency) and s3
(separate dependency) commiters for integration from Spark to S3.
May you give a code example?
May you give a code example?
Hi Sem, I try to print the hadoop version and got 3.3.1
.
You can checkout the code s3-test
and run it with mvn test -Dsuites="org.apache.graphar.GraphInfoSuite load graph info s3"
May you try s3a instead of s3?
May you try s3a instead of s3?
You mean change the path from s3://graphar-data/ldbc.graph.yml
to s3a://graphar-data/ldbc.graph.yml
?
May you try s3a instead of s3?
You mean change the path from
s3://graphar-data/ldbc.graph.yml
tos3a://graphar-data/ldbc.graph.yml
?
I just tried it and got error that Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
:
- load graph info s3 *** FAILED ***
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.graphar.GraphInfo$.loadGraphInfo(GraphInfo.scala:379)
at org.apache.graphar.GraphInfoSuite.$anonfun$new$2(TestGraphInfo.scala:52)
My environment is our dev image apache/graphar-dev:latest
: ubuntu 22.04 and openjdk 11.0.22
I think the reason is that GraphAr Spark relying on a custom implementation of Hadoop (com.aliyun
). My current guess is that an implementation of S3A commiter is not included in the "hadoop-oss" build we are using. Let me check it.
Ok, there is actually no s3a implementation. To bypass it it should be enough to add hadoop-aws
JAR (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) to "${HOME}/${{ matrix.spark-hadoop }}/jars"
folder (${SPARK_HOME}/jars
). I checked, the JAR is actually there.
For development we can add it constantly to JARs folder. But I do not like an idea of adding it to the distribution cause it may tends to dependency hell...
Documentation of Apache Spark directly mentioned that case:
Commercial products based on Apache Spark generally directly set up the classpath for talking to cloud infrastructures, in which case this module may not be needed.
I think with hadoop-aws
in CP it should also work with s3n
. May you try it? s3n
is a next generation of s3-commiters for Spark.
Ok, there is actually no s3a implementation. To bypass it it should be enough to add
hadoop-aws
JAR (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) to"${HOME}/${{ matrix.spark-hadoop }}/jars"
folder (${SPARK_HOME}/jars
). I checked, the JAR is actually there.For development we can add it constantly to JARs folder. But I do not like an idea of adding it to the distribution cause it may tends to dependency hell...
So I think we can add some guide to document to tell the user that if they want to use s3, they need to add hadoop-aws
by themselves?
I think with
hadoop-aws
in CP it should also work withs3n
. May you try it?s3n
is a next generation of s3-commiters for Spark.
Yes, I can try it .
So I think we can add some guide to document to tell the user that if they want to use s3, they need to add hadoop-aws by themselves?
What do you think about reference to the documentation of Apache Spark itself? Because otherwise it may be confusing. Most of spark distributions (like Databricks Runtime, Microsoft Fabric, EMR, Cludera Spark, etc.) already contains all the dependencies needed for integration with the corresponding cloud provider. This dependencies are often proprietary and adding an OSS hadoop-aws
may tend to unpredictable behaviour.
This section of Hadoop Documentation is very detailed.
So I think we can add some guide to document to tell the user that if they want to use s3, they need to add hadoop-aws by themselves?
What do you think about reference to the documentation of Apache Spark itself? Because otherwise it may be confusing. Most of spark distributions (like Databricks Runtime, Microsoft Fabric, EMR, Cludera Spark, etc.) already contains all the dependencies needed for integration with the corresponding cloud provider. This dependencies are often proprietary and adding an OSS
hadoop-aws
may tend to unpredictable behaviour.This section of Hadoop Documentation is very detailed.
Make sense to me. I can open a PR and add the document as reference.
@acezen I think we may also update a dev-container by adding all the spark dependencies for development, including hadoop-aws
to simplify development for people not familiar with all these Hadoop ecosystem.
Describe the enhancement requested
Currently the Scala SDK does not support S3
Add related dependencies to enable S3 reading and writing.
Component(s)
Spark