Open pengisgood opened 9 years ago
This behavior is known. If I were to guess, the files that the metadata are being lost are the ones that are being multipart copied. If a file is multipart copied, metadata is not automatically copied over like it is for a non-multipart copy. The problem is to get all of the information exactly transferred over for a multipart copy, it would require roughly 4 to 5 more calls (such as HeadObject
, GetObjectACL
calls) on the object to get all of the information required to do an exact copy. Do you happen to know the size of these objects that do not have the original metadata?
If that is the case, there is currently one work around which is to avoid using multipart copies. Take a look at this pull request: https://github.com/aws/aws-cli/pull/1122. Using the config file you can set the threshold upon which you start doing multipart thresholds. So if you set the threshold higher than the maximum size of your file, but less than 5GB, you should be better off.
Not to be sour or ungrateful but I just sank a few hours into figuring this out, combined with #319.
I had issues with the file metadata randomly appearing and not appearing in s3 sync in no discernible manner (initially I thought that s3 had some sort of unintentional "memory" of the metadata of the files being deleted and recreated) and then I finally found #1145. Workaround of aws configure set s3.multipart_threshold $MAX_SIZE --profile $PROFILENAME
worked.
This is the kind of known issue that should be in screaming red letters all over the s3 sync documentation. I just blew a few hours on it, which in the grand scheme of things is not that bad, because it could have been that I synced thousands of production data files over thinking it worked fine, and then deleted the originals. Silent data corruption errors are not cool, and I realize fixing something might not be simple on your end, but then in the interim please let your users know.
@kyleknap - you were right, it was the multipart thing.
Another side effect of aws s3 sync
using multipart uploads is that ObjectCreatedByPut
events are no longer sent to AWS Lambda, thus Lambda functions relying on this trigger won't work for files bigger than 8MB.
@makmanalp's work around seems to get around this issue too:
aws configure set s3.multipart_threshold 128MB
@makmanalp +1000, this is a fucking insane, massive bug. It'd be one thing if sync just dropped the content-encoding altogether, but the way things are currently, one might think "huh, I should double-check that sync
actually preserves content encoding" *try it on a few files* "okay, looks good, let's do it on everything".
Not to minimize the insane, massive bug, but this originates as a limitation of S3. https://github.com/aws/aws-sdk-java/issues/367
Unfortunately S3 does not support the x-amz-metadata-directive header on InitiateMultipartTransfer or CopyPart requests. I've raised this to the service team and will come back on this issue when I hear back from them.
@gribbet I contacted the S3 service team and they are aware of the inconsistency - it's possible that they'll fix it in a future version of the service. However given there is a workaround there are higher priority issues to resolve.
"Given there is a workaround" is perhaps generous in the case of aws s3 sync/cp
, and it could be argued that given S3's current inabilities, the CLI should choose a different default. Or it could implement the partial workaround itself.
This limitation appears on the CLI documentation, though it is somewhat buried considering the severity of the issue.
it could be argued that given S3's current inabilities, the CLI should choose a different default.
Yeah, definitely, it seems much saner to default to never using multi-part copies for objects with a content-encoding or other metadata that will be dropped by it. (Or even to always drop the metadata, maybe with a flag to preserve it).
Good Morning!
We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.
This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.
As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.
We’ve imported existing feature requests from GitHub - Search for this issue there!
And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.
GitHub will remain the channel for reporting bugs.
Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface
-The AWS SDKs & Tools Team
This entry can specifically be found on UserVoice at:https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168427-does-aws-s3-sync-will-loss-metadata-sometimes
Based on community feedback, we have decided to return feature requests to GitHub issues.
Hi, We are uploading very large files (>20GB) so we must use Multipart upload. Is there a plan when this inconsistency will be fixed? It is an old issue but from some reason being ignored.
😱
Scenario 1:
x-amz-meta-json
Scenario 2:
sync files between two buckets which is belong to two account and the same region, for example, one account is for development and the other one is for production, some files' metadata also lost with key like
x-amz-meta-json
.Note: the batch of files which lost metadata are the same in the two scenarios above.
Does anyone has the same issue?