aws / aws-sdk-go-v2

AWS SDK for the Go programming language.
https://aws.github.io/aws-sdk-go-v2/docs/
Apache License 2.0
2.5k stars 602 forks source link

Support for io.Reader Interface in S3 Transfer Manager's Downloader #2247

Open yacchi opened 10 months ago

yacchi commented 10 months ago

Describe the feature

Currently, the Download function implemented in the Transfer Managers Downloader accepts io.WriteAt. Due to this, after writing to a file or buffer, there is a need to create an io.Reader.

Frequently when working with files in Go, the io.Reader interface is commonly required. I believe that if the Downloader could directly produce an io.Reader, it would significantly improve usability.

Use Case

Proposed Solution

The behavior of the AWS CLI's cp command closely aligns with my expectations. For instance, it can be used as follows:

aws s3 cp s3://BUCKET/key.tar.gz | tar zxf -

Internally, it appears to use the heap to sequentially output chunks from the beginning. I've created code in my repository that operates in a similar manner using the current Downloader.

https://github.com/yacchi/s3-fast-reader

Other Information

No response

Acknowledgements

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.20.1 github.com/aws/aws-sdk-go-v2/config v1.18.33 github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.11.77 github.com/aws/aws-sdk-go-v2/service/s3 v1.38.2

Go version used

1.20.4

RanVaknin commented 10 months ago

Hi @yacchi,

This seems like a reasonable request. We will likely work on this when we re-implement Downloader.

For now we cannot prioritize this, but I will add this to our backlog.

Thanks! Ran~

lucix-aws commented 8 months ago

To @yacchi or anyone else who may be watching this issue --

I haven't tested this, but I believe you can achieve "sequential" I/O by setting download concurrency to 1, which is spec'd to guarantee sequential in-order multipart downloads. If that's the case your WriterAt effectively becomes safe to write to any sequential I/O implementation.

// sequentialWriterAt adapts WriteAt() calls to a sequential I/O implementation
type sequentialWriterAt struct {
    w io.Writer // or copy to another reader, etc.
    off int
}

func (v *sequentialWriterAt) WriteAt(p []byte, off int64) (int, error) {
    if off != v.off {
        return 0, fmt.Errorf("broken write sequence")
    }

    n, err := v.w.Write(p)
    if err != nil {
        return n, fmt.Errorf("write: %v", err)
    }

    v.off += n
    return n, nil
}

This is definitely something we'd like to support for concurrent downloads, though.

yacchi commented 8 months ago

@lucix-aws

Thanks for the helpful code. It looks like I can certainly do sequentialization that way. However, if you don't want concurrency, I think the standard GetObject method is easier because it provides the io.Reader interface.

I implemented the io.Reader interface, which allows concurrency, to solve the following problem.

By using the implemented code, it is possible to download the tar file at an average speed of 800MiB/s or more in an environment with sufficient memory.

jobstoit commented 2 months ago

@lucix-aws @RanVaknin I opened a PR for a concurrent io.Reader/io.WriteCloser for this issue. See #2622

jobstoit commented 2 months ago

@yacchi the PR was rejected cause they want to overhaul the manager. until they'd have something you could use my s3io package to do read/write operations with s3 objects (concurrently with chunks similar to how the manager handles files)