Open b23g5r42i opened 1 month ago
In file-like mode (using open()
), you are limited by the blocking nature of the API. s3fs will upload one block_size chunk at a time, serially. This is configurable block_size=
and only 5MB by default, optimised for minimum memory use. We have discussed greatly increasing the value, but you can do this yourself in the call.
The non-file upload-from-disk method is
fs.put_file(file_path)
and already has a larger 50MB block (called chunksize
in the call). These calls are also make the calls concurrently. This is what you want.
@martindurant Thanks for the reply! Increasing the block_size can indeed boost speed by up to 50% in my tests, but it's still about twice as slow as boto3, which I believe benefits from better concurrency handling.
The put_file() method looks useful, but it seems to work only with file paths. Is there a version modified with fsspec that could, for example, support pickle.dump() directly? My main goal is to unify I/O operations with upath + fsspec, including S3
Would you care to test with #901
It's worth pointing out that s3transfer and maybe boto3 use threads and/or processes for parallelism, which matters in low0latency situations where the CPU time for stream compression might be significant. s3fs is single-threaded.
it seems to work only with file paths
I was trying to match your code of writing the whole contents of a file. To push bytes from memory in one go:
fs.pipe(path, value)
(where you could have value = pickle.dumps(..)
).
I also started #901 for you to speed test.
Hi, with pickle.dumps(), it calls fsspec's write() interface, so I think we only need to make change to line to enable the new block size. Meanwhile, in my testing I find max_concurrency has no impact to performance...
Here is my testing code:
with fs.open(f'{bucket_name}/{key}', 'wb', block_size=BLOCK_SIZE, max_concurrency=CONCURRENCY) as f:
f.write(open(file_path, 'rb').read())
With an 1GB object, default block size 5MB gives me 40MB/s while 50MB gives 70MB/s, and 500MB gives 90MB/s. But boto3 gives 160MB/s. Changing max_concurrency from 1 to 10 has not impact to the speed at all.
I've noticed that while uploading large files (greater than 1GB), s3fs.write() performs around three times slower than the boto3.upload_file() API.
Is this slower performance expected when using s3fs, and are there any configurations or optimizations that could improve its upload speed?