Closed kevinjqiu closed 1 year ago
Hi @kevinjqiu,
Thanks for reaching out. What you are describing is not a bug with the SDK, but rather a data intensive application concept called "read skew" and one solution for that is exactly what you described here:
When GetObjectInput.versionId is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set the versionId in the subsequent downloadChunk method.
The SDK team cannot cover this use case in our current implementation mainly because adding an extra API call would come with an additional cost both in terms of $ value, and performance hit, and would not accomodate the majority of customers. Not to mention it might not be backwards compatible.
The solution to this is implementing this mechanism on the application level. You need to implement you own version checking ("optimistic locking") mechanism to make sure you have the appropriate version before you run GetObject. The GetObjectInput interface takes in a VersionId
that makes this easier:
latestObjectVersion := getLatestVersion("BUCKET", "TEST") // you need to implement this function.
goi := &s3.GetObjectInput{
Bucket: aws.String("BUCKET"),
Key: aws.String("TEST"),
VersionId: aws.String(latestObjectVersion),
}
I'm not sure what is the business case for your application, but if the above approach proves to be too difficult, you might want to look into event driven approach with SQS for example. instead of consumers independently polling S3 for the latest version of the object, they would be notified when a new version is available and then proceed to download it.
I understand that this must be frustrating to read, but as it is, its not really actionable by the SDK team so Im inclined to close this. If you have any additional questions, please consider opening a discussion in the discussion tab on our repo.
Thanks again, Ran~
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
@RanVaknin Thanks for the reply. I think at the very least, the risk of read skew should be called out in the documentation.
From the user's perspective, we call downloader.Download()
to get the latest version of the object. We don't really know (or should we know) that the object is split and stored in multiple chunks and there's a possibility where different version of the chunks are downloaded if the underlying object has been updated in the process. The API should abstract this away but I understand the concerns you laid out, so I think at least the documentation should be updated to let the user know the risk exists and consider the mitigation on the client side. In our case, we found out the hard way.
Hi @kevinjqiu
Thanks for the follow up.
We don't really know (or should we know) that the object is split and stored in multiple chunks
You are using the Downloader package which is the multipart download utility for the SDK. If you do not need to download the object in parts you can use S3.GetObject directly.
In terms of the documenting this, this is not an edge case unique to the SDK. Read Skews can happen in many many frameworks, and since you are designing a data intensive application, you need to be aware of the various strategies to mitigate their inherent caveats and risks. The same way we don't document that you could run into issues with idempotent writes when interacting with an RDS table, we don't document general programming best practices. These are concepts that apply at an infrastructural level and are not unique to client-side tools like the SDK
I appreciate your input and understand the concerns you've raised. Our goal is to keep the SDK documentation focused on its core functionality while providing clear guidance. If there are specific gaps or ambiguities you've noticed in relation to the SDK itself, feel free to open a separate documentation issue with actionable feedback.
Thanks again, Ran~
@RanVaknin To be frank, I find it wholly unacceptable that AWS's stance here is to do absolutely nothing to help its customers. Unless you explicitly happen to test this race condition, and get the timing right, you won't notice until you already have an issue in production.
At the very least, the SDK MUST return an error if it detects that two chunks came from different versions during a multipart download.
Describe the bug
Expected Behavior
The latest version of the object is downloaded
Current Behavior
An error is encountered from time to time, e.g.,
or
With logging turned on, it's observed that when a later part is being downloaded and when the object is updated before that, the later chunk from a different version is downloaded and therefore corrupting the output.
Reproduction Steps
Minimally reproducible example:
Producer
The producer is simply a script that uploads a gzipped file (greater than 5MB) to a bucket constantly
Consumer
Possible Solution
When
GetObjectInput.versionId
is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set theversionId
in the subsequentdownloadChunk
method.Additional Information/Context
No response
SDK version used
1.44
Environment details (Version of Go (
go version
)? OS name and version, etc.)1.20