gaul / s3proxy

Access other storage backends via the S3 API
Apache License 2.0
1.72k stars 224 forks source link

CompleteMultipartUpload is slow for large objects on filesystem storage backend #471

Open Jayd603 opened 1 year ago

Jayd603 commented 1 year ago

I have s3proxy on a ubuntu server writing to a network share on the local file system... with very large uploads (150GB+), the uploads seem to be completing ok but the aws client says it timed out. It looks like there are 16mb multipart chunks being created and that is where the IO delay is happening. I'm assuming an ack to client isn't taking place until after the mpus are combined? Just a guess, but before I start digging into things, are there any existing settings to control multipart behavior in s3proxy?

update: --cli-read-timeout 0 worked - no more timeout, so i made that 7200

update: I can't seem to change the multipart chunk size, how is s3proxy handling mpu's? I tried to set higher values on the aws client but it chokes.

gaul commented 1 year ago

This poor performance is a compromise since S3Proxy attempts to store objects in the native file format. Each multipart upload creates a temporary object. When the client call complete MPU, S3Proxy reads all the individual parts and writes them to the final combined file. Thus S3Proxy is doing 3x IO (write, read, write). This is different that native object stores which keep the parts separately and only does the initial write.

On a specific filesystem like XFS perhaps S3Proxy could use reflinks to reduce the copy cost although Java lacks access to these APIs. Alternatively S3Proxy could add a mode that keeps the object parts in their original format. Both of these approaches seem difficult. Maybe there is some more performant way to do the simple copy, e.g., fallocate or something?

gaul commented 11 months ago

Java 20 adds support for copy_file_range:

https://bugs.openjdk.org/browse/JDK-8264744

I believe that this can trigger the reflink mechanisms in btrfs and xfs.