axman6 / amazonka-s3-streaming

Provides a conduit based interface to uploading data to S3 using the Multipart API
MIT License
20 stars 23 forks source link

S3 Multipart max 1024 chunks #6

Closed tolysz closed 7 years ago

tolysz commented 7 years ago

S3 Has a maximum of 1024 chunks thus when uploading, say 80GiB file it would be 16k+ chunks at the current size.

There should be some more robust approach eg:

chunkSize = max minS3ChunkSize (ceiling (fileSize / 1024))

There should be some more clever way of sending as the max limit for s3 files is 5TiB, and currently it allocates memory.

axman6 commented 7 years ago

Hey @tolysz, thanks for the report, I must've missed that. The fix should be pretty easy, so I'll look into it this weekend. Might need to change the API for streaming somewhat though, to allow the user to set a chunk size. Part of the point of this package was for times when you don't know the size of the data ahead of time, so making sure chunks are big enough will be tough.

tolysz commented 7 years ago

Regarding mmap, we could use chunkedFileOffsetLength from https://github.com/brendanhay/amazonka/pull/359 once it is accepted.

forM params $ \(partnum,off,size) -> do
   bdy <- chunkedFileOffsetLength AWS.defaultChunkSize fp (fromIntegral off) (fromIntegral size)
   umr <- send . uploadPart bucket key partnum upId $ bdy

not sure with forM vs forConcurently, we really need a pool of workers rather than 1 or 1024...

tolysz commented 7 years ago

I did more research, the max is 10000 parts.

Item Specification
Maximum object size 5 TB
Maximum number of parts per upload 10,000
Part numbers 1 to 10,000 (inclusive)
Part size 5 MB to 5 GB, last part can be < 5 MB

http://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html Maybe it is a moving target.

axman6 commented 7 years ago

So looking at the chunkedFileOffsetLength from brendanhay/amazonka#359, I've tried to avoid reading the contents of the file more than once, hence the use of mmap and reading the hash from the data which needs to be in memory anyway. I'll still need to sort out the 10,000 part limit but I think I'll stick to the mmap.