apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.
https://graphar.apache.org/
Apache License 2.0
223 stars 46 forks source link

feat(scala): Support S3 reading/writing #570

Closed acezen closed 3 months ago

acezen commented 3 months ago

Describe the enhancement requested

Currently the Scala SDK does not support S3

org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
  at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
  at org.apache.graphar.GraphInfo$.loadGraphInfo(GraphInfo.scala:379)
  at org.apache.graphar.GraphInfoSuite.$anonfun$new$2(TestGraphInfo.scala:49)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

Add related dependencies to enable S3 reading and writing.

Component(s)

Spark

SemyonSinchenko commented 3 months ago

Did you check this part of spark docs? Which version of Hadoop are you using?

SemyonSinchenko commented 3 months ago

Anyway, I do not think we should explicitly add any AWS-commiter dependencies. As I know different spark distributions (Databricks, Cloudera, EMR, etc.) are using different implementations of S3-commiter. Adding any AWS dependency to GraphAr explicitly may tend to dependency hell for some users.

I know at least s3a, s3n (separete dependency) and s3 (separate dependency) commiters for integration from Spark to S3.

SemyonSinchenko commented 3 months ago

May you give a code example?

acezen commented 3 months ago

May you give a code example?

Hi Sem, I try to print the hadoop version and got 3.3.1.

You can checkout the code s3-test

and run it with mvn test -Dsuites="org.apache.graphar.GraphInfoSuite load graph info s3"

SemyonSinchenko commented 3 months ago

May you try s3a instead of s3?

acezen commented 3 months ago

May you try s3a instead of s3?

You mean change the path from s3://graphar-data/ldbc.graph.yml to s3a://graphar-data/ldbc.graph.yml?

acezen commented 3 months ago

May you try s3a instead of s3?

You mean change the path from s3://graphar-data/ldbc.graph.yml to s3a://graphar-data/ldbc.graph.yml?

I just tried it and got error that Class org.apache.hadoop.fs.s3a.S3AFileSystem not found:

- load graph info s3 *** FAILED ***
  java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
  at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
  at org.apache.graphar.GraphInfo$.loadGraphInfo(GraphInfo.scala:379)
  at org.apache.graphar.GraphInfoSuite.$anonfun$new$2(TestGraphInfo.scala:52)

My environment is our dev image apache/graphar-dev:latest: ubuntu 22.04 and openjdk 11.0.22

SemyonSinchenko commented 3 months ago

I think the reason is that GraphAr Spark relying on a custom implementation of Hadoop (com.aliyun). My current guess is that an implementation of S3A commiter is not included in the "hadoop-oss" build we are using. Let me check it.

SemyonSinchenko commented 3 months ago

Ok, there is actually no s3a implementation. To bypass it it should be enough to add hadoop-aws JAR (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) to "${HOME}/${{ matrix.spark-hadoop }}/jars" folder (${SPARK_HOME}/jars). I checked, the JAR is actually there.

For development we can add it constantly to JARs folder. But I do not like an idea of adding it to the distribution cause it may tends to dependency hell...

image

SemyonSinchenko commented 3 months ago

Documentation of Apache Spark directly mentioned that case:

Commercial products based on Apache Spark generally directly set up the classpath for talking to cloud infrastructures, in which case this module may not be needed.

SemyonSinchenko commented 3 months ago

image I think with hadoop-aws in CP it should also work with s3n. May you try it? s3n is a next generation of s3-commiters for Spark.

acezen commented 3 months ago

Ok, there is actually no s3a implementation. To bypass it it should be enough to add hadoop-aws JAR (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) to "${HOME}/${{ matrix.spark-hadoop }}/jars" folder (${SPARK_HOME}/jars). I checked, the JAR is actually there.

For development we can add it constantly to JARs folder. But I do not like an idea of adding it to the distribution cause it may tends to dependency hell...

image

So I think we can add some guide to document to tell the user that if they want to use s3, they need to add hadoop-aws by themselves?

acezen commented 3 months ago

image I think with hadoop-aws in CP it should also work with s3n. May you try it? s3n is a next generation of s3-commiters for Spark.

Yes, I can try it .

SemyonSinchenko commented 3 months ago

So I think we can add some guide to document to tell the user that if they want to use s3, they need to add hadoop-aws by themselves?

What do you think about reference to the documentation of Apache Spark itself? Because otherwise it may be confusing. Most of spark distributions (like Databricks Runtime, Microsoft Fabric, EMR, Cludera Spark, etc.) already contains all the dependencies needed for integration with the corresponding cloud provider. This dependencies are often proprietary and adding an OSS hadoop-aws may tend to unpredictable behaviour.

This section of Hadoop Documentation is very detailed.

acezen commented 3 months ago

So I think we can add some guide to document to tell the user that if they want to use s3, they need to add hadoop-aws by themselves?

What do you think about reference to the documentation of Apache Spark itself? Because otherwise it may be confusing. Most of spark distributions (like Databricks Runtime, Microsoft Fabric, EMR, Cludera Spark, etc.) already contains all the dependencies needed for integration with the corresponding cloud provider. This dependencies are often proprietary and adding an OSS hadoop-aws may tend to unpredictable behaviour.

This section of Hadoop Documentation is very detailed.

Make sense to me. I can open a PR and add the document as reference.

SemyonSinchenko commented 3 months ago

@acezen I think we may also update a dev-container by adding all the spark dependencies for development, including hadoop-aws to simplify development for people not familiar with all these Hadoop ecosystem.