BlueBrain / functionalizer

Create a functional connectome from physical connections between cells
https://functionalizer.readthedocs.io/en/stable/
Apache License 2.0
1 stars 0 forks source link

Reduce dependency on NVME storage / HDFS #8

Open matz-e opened 1 month ago

matz-e commented 1 month ago

Due to the very degraded performance with many small files, Functionalizer spawns a Hadoop file system cluster and stores checkpoint data there.

This blows up the Functionalizer Docker container size due to a full installation of Hadoop being required, and it requires us to use larger SSD storage on nodes. We should look into storing checkpoints somewhere else.

matz-e commented 2 days ago

I managed to create the following pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>ch.epfl.bbp.functionalizer</groupId>
  <artifactId>Functionalizer</artifactId>
  <version>1.0-SNAPSHOT</version>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
  </dependencies>
</project>

and installed the contents with Maven via:

mvn install dependency:copy-dependencies

Then using the following Python test script (called sls.py):

import pyspark
from pathlib import Path

conf = pyspark.conf.SparkConf()
conf.setMaster("local").setAppName("teschd")
jars = Path(".").resolve() / "target" / "dependency" / "*"
conf.set("spark.driver.extraClassPath", jars)
conf.set("spark.executor.extraClassPath", jars)
# conf.set("spark.hadoop.fs.s3a.endpoint", "https://bbpobjectstorage.epfl.ch")
# conf.set("spark.hadoop.fs.s3a.endpoint.region", "ch-gva-1")
# conf.set("spark.hadoop.fs.s3a.access.key", "")
# conf.set("spark.hadoop.fs.s3a.secret.key", "")
# conf.set("log4j.logger.software.amazon.awssdk.request", "DEBUG")

sc = pyspark.context.SparkContext(conf=conf)
sql = pyspark.sql.SQLContext(sc)

df = sql.read.parquet("s3a://hornbach-please-delete-me/touchesData.0.parquet")
df.show()
# sql.read.parquet("s3a://access-test/dumbo.parquet")

I was able to access the S3 bucket referenced in the script with

python sls.py

of course with the right AWS access keys exported into the shell environment. This did not work attempting to access an S3 bucket on NetApp.

This would allow to store the checkpoints on a temporary S3 bucket rather than spawning a Hadoop cluster just for this purpose.