S3 `Downloader` potentially downloads corrupted object when the it is split into multiple parts

kevinjqiu commented 1 year ago

Describe the bug

We have a system that's constantly updating an object stored in an S3 bucket
The object is downloaded by multiple clients concurrently
In our particular case, the stored object is gzipped but the issue applies to all sorts of files
When the object grows to over 5MB (which causes the downloader to download the object in multiple parts), we encountered an issue where the second part is downloaded from a different version of the object, therefore corrupting the download

Expected Behavior

The latest version of the object is downloaded

Current Behavior

An error is encountered from time to time, e.g.,

2023/09/12 14:54:39
panic: flate: corrupt input before offset 1089400

goroutine 1 [running]:
main.fetch(0x14000146640)
        /Users/kevinqiu/src/tmp/s3bugrepro/main.go:36 +0x2a0
main.main()
        /Users/kevinqiu/src/tmp/s3bugrepro/main.go:57 +0xd0
exit status 2

or

panic: gzip: invalid checksum

With logging turned on, it's observed that when a later part is being downloaded and when the object is updated before that, the later chunk from a different version is downloaded and therefore corrupting the output.

Reproduction Steps

Minimally reproducible example:

Producer

The producer is simply a script that uploads a gzipped file (greater than 5MB) to a bucket constantly

#! /bin/bash
while true; do
    files="0 1 2 3"
    for i in $files; do
        aws s3 cp $i s3://$BUCKET/TEST
        sleep 1
    done
done

Consumer

package main

import (
    "bytes"
    "compress/gzip"
    "fmt"
    "io"
    "time"

    "github.com/aws/aws-sdk-go/aws"
    "github.com/aws/aws-sdk-go/aws/session"
    "github.com/aws/aws-sdk-go/service/s3"
    "github.com/aws/aws-sdk-go/service/s3/s3manager"
)

func fetch(downloader *s3manager.Downloader) {
    buff := &aws.WriteAtBuffer{}
    goi := &s3.GetObjectInput{
        Bucket: aws.String("BUCKET"),  // replace with the real bucket name
        Key:    aws.String("TEST"),
    }
    _, err := downloader.Download(buff, goi)
    if err != nil {
        panic(err)
    }

    b := buff.Bytes()
    br := bytes.NewReader(b)
    gzr, err := gzip.NewReader(br)
    if err != nil {
        panic(err)
    }

    uzb, err := io.ReadAll(gzr)
    if err != nil && err != io.EOF {
        panic(err)
    }

    fmt.Printf("Size=%v\n", len(uzb))
}

func main() {
    s, err := session.NewSession(&aws.Config{
        Region:   aws.String("us-east-1"),
        LogLevel: aws.LogLevel(aws.LogDebugWithHTTPBody),
    })

    if err != nil {
        panic(err)
    }

    downloader := s3manager.NewDownloader(s, func(downloader *s3manager.Downloader) {
        downloader.PartSize = 1024 * 512  // set to a small chunk size so the problem can be reproduced sooner
    })

    for {
        fetch(downloader)
        time.Sleep(2 * time.Second)
    }
}

Possible Solution

When GetObjectInput.versionId is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set the versionId in the subsequent downloadChunk method.

Additional Information/Context

No response

SDK version used

1.44

Environment details (Version of Go (`go version`)? OS name and version, etc.)

1.20

RanVaknin commented 1 year ago

Hi @kevinjqiu,

Thanks for reaching out. What you are describing is not a bug with the SDK, but rather a data intensive application concept called "read skew" and one solution for that is exactly what you described here:

When GetObjectInput.versionId is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set the versionId in the subsequent downloadChunk method.

The SDK team cannot cover this use case in our current implementation mainly because adding an extra API call would come with an additional cost both in terms of $ value, and performance hit, and would not accomodate the majority of customers. Not to mention it might not be backwards compatible.

The solution to this is implementing this mechanism on the application level. You need to implement you own version checking ("optimistic locking") mechanism to make sure you have the appropriate version before you run GetObject. The GetObjectInput interface takes in a VersionId that makes this easier:

latestObjectVersion := getLatestVersion("BUCKET", "TEST") // you need to implement this function.

goi := &s3.GetObjectInput{
    Bucket:    aws.String("BUCKET"),
    Key:       aws.String("TEST"),
    VersionId: aws.String(latestObjectVersion),
}

I'm not sure what is the business case for your application, but if the above approach proves to be too difficult, you might want to look into event driven approach with SQS for example. instead of consumers independently polling S3 for the latest version of the object, they would be notified when a new version is available and then proceed to download it.

I understand that this must be frustrating to read, but as it is, its not really actionable by the SDK team so Im inclined to close this. If you have any additional questions, please consider opening a discussion in the discussion tab on our repo.

Thanks again, Ran~

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

kevinjqiu commented 1 year ago

@RanVaknin Thanks for the reply. I think at the very least, the risk of read skew should be called out in the documentation.

From the user's perspective, we call downloader.Download() to get the latest version of the object. We don't really know (or should we know) that the object is split and stored in multiple chunks and there's a possibility where different version of the chunks are downloaded if the underlying object has been updated in the process. The API should abstract this away but I understand the concerns you laid out, so I think at least the documentation should be updated to let the user know the risk exists and consider the mitigation on the client side. In our case, we found out the hard way.

RanVaknin commented 1 year ago

Hi @kevinjqiu

Thanks for the follow up.

We don't really know (or should we know) that the object is split and stored in multiple chunks

You are using the Downloader package which is the multipart download utility for the SDK. If you do not need to download the object in parts you can use S3.GetObject directly.

In terms of the documenting this, this is not an edge case unique to the SDK. Read Skews can happen in many many frameworks, and since you are designing a data intensive application, you need to be aware of the various strategies to mitigate their inherent caveats and risks. The same way we don't document that you could run into issues with idempotent writes when interacting with an RDS table, we don't document general programming best practices. These are concepts that apply at an infrastructural level and are not unique to client-side tools like the SDK

I appreciate your input and understand the concerns you've raised. Our goal is to keep the SDK documentation focused on its core functionality while providing clear guidance. If there are specific gaps or ambiguities you've noticed in relation to the SDK itself, feel free to open a separate documentation issue with actionable feedback.

Thanks again, Ran~

rittneje commented 1 year ago

@RanVaknin To be frank, I find it wholly unacceptable that AWS's stance here is to do absolutely nothing to help its customers. Unless you explicitly happen to test this race condition, and get the timing right, you won't notice until you already have an issue in production.

At the very least, the SDK MUST return an error if it detects that two chunks came from different versions during a multipart download.

aws / aws-sdk-go