boto / s3transfer

Amazon S3 Transfer Manager for Python
Apache License 2.0
203 stars 132 forks source link

Support multipart downloads when downloading large ranges via TransferManager.download() #248

Open forrestfwilliams opened 1 year ago

forrestfwilliams commented 1 year ago

This issue references issues #1215, and its duplicate #3466 from the boto3 repository. It has also been discussed in this stackOverflow post.

Issue

s3transfer supports ranged download requests and multipart downloads, however it is not possible to perform a multi-part download over a specific range. This results in slow download times when attempting to download a 1GB range of data from a 4GB file in S3.

Use Case

I work at the Alaska Satellite Facility, where we distribute large amounts of remote sensing data to users across the globe via AWS. Many of these datasets come in legacy formats, such as zip files, that are not cloud-friendly. Due to the highly structured nature of these datasets, we can identify byte ranges that contain subsets of data that our users would be interested in downloading directly. However, since these datasets are still large (~1GB within a larger 4GB zip file), and multipart downloads are not supported for range requests, we cannot offer extraction of these dataset with low latency. I know of many other groups that have encountered this issue while trying to distribute large remote sensing datasets.

Proposed Solution

It would be great if a range argument were added to TransferConfig, that could then be passed to a TransferManager.download() call, which would then download data ranges with sizes greater than the multipart_threshold via a multipart download.

I am willing to participate in developing this solution.

forrestfwilliams commented 1 year ago

@tim-finnigan is there any update on this work? Excited to see #260!

tim-finnigan commented 1 year ago

I don't have any updates at the moment but will check in with the team.

forrestfwilliams commented 1 year ago

@tim-finnigan just checking in again. Did you hear back from the team?

tim-finnigan commented 1 year ago

Hi @forrestfwilliams thanks for your patience and apologies for the delay in getting back to you. This issue was reviewed in the last couple of weeks and it was determined that it will need some further investigation at a cross-SDK level. I think there are some planned improvements related to S3 transfers that may or may not overlap with this issue. I wish I had more details to share at this point but unfortunately that is the extent of what I know at this time. I'll still plan to update this issue when there is more information to share.

forrestfwilliams commented 1 year ago

Hey @tim-finnigan, any updates on the "planned improvements related to S3 transfers" that overlap with this issue? Thanks!

tim-finnigan commented 1 year ago

Hi @forrestfwilliams thanks for following up - this feature request is still in process but moving forward. It is part of a broader effort to improve S3 transfers across SDKs and a thorough review process is required before the logic would be updated.