astrolabsoftware / spark-fits

FITS data source for Spark SQL and DataFrames
https://astrolabsoftware.github.io/spark-fits/
Apache License 2.0
20 stars 7 forks source link

Benchmarking on s3 #67

Open JulienPeloton opened 5 years ago

JulienPeloton commented 5 years ago

I'm trying to benchmark spark-fits on s3, by internally looping over the same piece of code:

path = "s3a://abucket/..."
fn = "afile.fits" # 700 MB

for index in range(N):
  df = spark.read\
    .format("fits")\
    .option("hdu", 1)\
    .load(os.path.join(path, fn))

  start = time.time()
  df.count()
  elapsed = time.time() - start
  print("{} seconds".format(elapsed))

With the default s3 configuration, it hangs after the first iteration, and I get a timeout error. I found that increasing the parameter that controls the maximum number of simultaneous connections to S3 (fs.s3a.connection.maximum) from 15 to 100 fixes somehow the problem. It is not clear exactly why and how, so it would be good to investigate further.