Closed AMZN-hgoffin closed 2 years ago
Oh, it's even worse than I thought. The double-encoding behavior and other weirdness is fully internal to S3. The example result strings in the documentation are double-encoded, and it's documented to do this.
With the check removed, boto correctly signs and sends the data to S3 using UTF-8 encoding. However, the S3 backend then internally re-encodes the UTF-8 bytes using UTF-8 again.
This is evident through testing PUT with metadata of increasingly long lengths, until the PUT is rejected by the server. The longest possible ASCII metadata in a PUT is six bytes longer than the longest possible metadata which is ASCII except for a single trademark ™ character. The trademark character should only require 3 UTF-8 bytes, but here we are.
Given how broken this is, perhaps the right thing to do is simply to change the comment string and the user-facing exception text and point out that the behavior of non-ASCII metadata is nonsensical and it is intentionally not supported by boto3.
Hi @AMZN-hgoffin, thanks for reaching out. I think issue https://github.com/boto/boto3/issues/478 and specifically this comment https://github.com/boto/boto3/issues/478#issuecomment-180608544 help give more context here.
The error message I get when attempting your code snippet is:
botocore.exceptions.ParamValidationError: Parameter validation failed:
Non ascii characters found in S3 metadata for key "meta", value: "™".
S3 metadata can only contain ASCII characters.
Which seems to make sense, but to your point on this quote from the S3 documentation:
Amazon S3 allows arbitrary Unicode characters in your metadata values
I think we need more clarity on this from the S3 team. I opened a ticket and think we should use that to track this topic.
P54643421
Hi @AMZN-hgoffin - I saw you followed up on the ticket so I think we can keep the conversation focused there with the S3 team. I’ll go ahead and close this issue.
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Hi @AMZN-hgoffin I’m reopening this issue for discussion. Someone just asked about this topic in the aws-cli repository. I know this is still be discussed internally but I want to make sure the CLI/Python team is kept in the loop and tracking this here, especially if any code or documentation changes are required on our end.
Thanks @tim-finnigan to add me in this loop. Please keep posted if any update or change made.
I just heard back from the S3 docs team and the documentation below was added here:
When using non US-ASCII characters in your metadata values, the provided unicode string is examined for non US-ASCII characters. Values of such headers are character decoded as per RFC 2047 before storing and encoded as per RFC 2047 to make them mail-safe before returning. If the string contains only US-ASCII characters, it is presented as is.
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Thanks @tim-finnigan. When would botocore support this behavior? I tested aws cli with version aws-cli/2.7.3 and seems it still failed.
root@xxx ~# aws --version aws-cli/2.7.3 Python/3.9.11 Linux/3.10.0-1127.el7.x86_64 exe/x86_64.centos.7 prompt/off root@xxx ~# aws s3api put-object --bucket xd-bk-6 --key abc --meta a=中 --profile aws
Parameter validation failed: Non ascii characters found in S3 metadata for key "a", value: "中". S3 metadata can only contain ASCII characters.
Hi @buptwxd2 I'm going to reopen your original issue and continue the discussion there, as this issue pertains more to the documentation update.
Hi @tim-finnigan ,Sure. Thanks
(edited based on later comments)
Attempting to put a non-ASCII string as a user metadata value on a S3 object fails due to an explicit prohibition in the boto3 source code: https://github.com/boto/botocore/blob/04d1fae43b657952e49b21d16daa86378ddb4253/botocore/handlers.py#L543
Example:
The documentation cited in the validation function linked above has changed since the code was written, and now states "Amazon S3 allows arbitrary Unicode characters in your metadata values" (https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html)
The mismatch between what boto does and what S3 claims to support was surprising and wasted some time.
If I monkey-patch this check out, everything works fine and the PUT succeeds, as the documentation suggests it should. When fetching the object, the metadata is UTF-8 encoded and then base64 encoded for ASCII transmission via http headers, again totally as documented.
In order to have boto support the documented behavior (which is, admittedly, not great behavior - double-encoding all of the strings) we could revert https://github.com/boto/botocore/pull/861 or else make it check keys only (instead of values).
Alternatively, the Unicode error message could be updated to say that S3's support for Unicode values over REST protocol is incomplete, and therefore boto intentionally does not support setting non-ASCII values because it is not fully round-trip safe.