fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
892 stars 274 forks source link

Copy operation of multiple files (around 450mb each) fails with #897

Closed manugarri closed 1 month ago

manugarri commented 1 month ago

I am trying to copy all the parquet files (1001 files, around 450MB each) inside an s3 folder into another s3 folder.

Regarding versions, im using python 3.10 and tried with both s3fs = "^2023.4.0" and s3fs = "2024.9.0" The code I used is the following:

import s3fs
s3_fs = s3fs.S3FileSystem()
s3_fs.cp(in_path, out_path, recursive=True)

This command fails after 16m 30 seconds with the following error (with logging debug enabled). Each retry fails after similar time:

Client error (maybe retryable): An error occurred (RequestTimeTooSkewed) when calling the CopyObject operation: The difference between the request time and the current time is too large.

I synced my machine using ntp with aws time servers and still getting the same issue.

I see this issue on a separate repo from 2016! where the user experiences the same error and almost similar timeout (16m34s), it cant be a coincidence.

Using the aws cli sync command works with no issue.

martindurant commented 1 month ago

Probably this means that too many copy operations are requested at the same time, so that some don't even start until a long time after. It is possibel to limit the number of operations happening at the same time with the argument batch_size=, which is a big number by default (effectively all the files at once).

manugarri commented 1 month ago

@martindurant thanks for the suggestion. I see on the fsspec spec that the default is 128 files at a time. Im going to try reducing the batch size to a lower number and see if that does the trick. Will report back!

manugarri commented 1 month ago

@martindurant that worked! I set the batch_size to 100 and the job run successfully. I wonder if there is somewhere we can write this caveat (copying a big number of large files will err out by default).

martindurant commented 1 month ago

Where to write about this is a good question. The functionality exists in fsspec.asyn for any asynchronous backend, and the "best" value of the batch size will depend a lot on connectivity to the remote store in question and the size of the files.

Perhaps a good step would be to wrap the very specific RequestTimeTooSkewed error (here), and suggest varying the batch parameter in the text returned. Currently it is converted to a PermissionError.