Reduce dependency on NVME storage / HDFS

I managed to create the following pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>ch.epfl.bbp.functionalizer</groupId>
  <artifactId>Functionalizer</artifactId>
  <version>1.0-SNAPSHOT</version>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
  </dependencies>
</project>

and installed the contents with Maven via:

mvn install dependency:copy-dependencies

Then using the following Python test script (called sls.py):

import pyspark
from pathlib import Path

conf = pyspark.conf.SparkConf()
conf.setMaster("local").setAppName("teschd")
jars = Path(".").resolve() / "target" / "dependency" / "*"
conf.set("spark.driver.extraClassPath", jars)
conf.set("spark.executor.extraClassPath", jars)
# conf.set("spark.hadoop.fs.s3a.endpoint", "https://bbpobjectstorage.epfl.ch")
# conf.set("spark.hadoop.fs.s3a.endpoint.region", "ch-gva-1")
# conf.set("spark.hadoop.fs.s3a.access.key", "")
# conf.set("spark.hadoop.fs.s3a.secret.key", "")
# conf.set("log4j.logger.software.amazon.awssdk.request", "DEBUG")

sc = pyspark.context.SparkContext(conf=conf)
sql = pyspark.sql.SQLContext(sc)

df = sql.read.parquet("s3a://hornbach-please-delete-me/touchesData.0.parquet")
df.show()
# sql.read.parquet("s3a://access-test/dumbo.parquet")

I was able to access the S3 bucket referenced in the script with

python sls.py

of course with the right AWS access keys exported into the shell environment. This did not work attempting to access an S3 bucket on NetApp.

This would allow to store the checkpoints on a temporary S3 bucket rather than spawning a Hadoop cluster just for this purpose.

BlueBrain / functionalizer

Reduce dependency on NVME storage / HDFS #8