With the default s3 configuration, it hangs after the first iteration, and I get a timeout error. I found that increasing the parameter that controls the maximum number of simultaneous connections to S3 (fs.s3a.connection.maximum) from 15 to 100 fixes somehow the problem. It is not clear exactly why and how, so it would be good to investigate further.
I'm trying to benchmark spark-fits on s3, by internally looping over the same piece of code:
With the default s3 configuration, it hangs after the first iteration, and I get a
timeout
error. I found that increasing the parameter that controls the maximum number of simultaneous connections to S3 (fs.s3a.connection.maximum
) from 15 to 100 fixes somehow the problem. It is not clear exactly why and how, so it would be good to investigate further.