Benchmarking on s3 - Githubissues

I'm trying to benchmark spark-fits on s3, by internally looping over the same piece of code:

path = "s3a://abucket/..."
fn = "afile.fits" # 700 MB

for index in range(N):
  df = spark.read\
    .format("fits")\
    .option("hdu", 1)\
    .load(os.path.join(path, fn))

  start = time.time()
  df.count()
  elapsed = time.time() - start
  print("{} seconds".format(elapsed))

With the default s3 configuration, it hangs after the first iteration, and I get a timeout error. I found that increasing the parameter that controls the maximum number of simultaneous connections to S3 (fs.s3a.connection.maximum) from 15 to 100 fixes somehow the problem. It is not clear exactly why and how, so it would be good to investigate further.

astrolabsoftware / spark-fits

Benchmarking on s3 #67