boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.48k stars 1.08k forks source link

boto disagrees with S3 documentation on unicode metadata #2552

Closed AMZN-hgoffin closed 2 years ago

AMZN-hgoffin commented 2 years ago

(edited based on later comments)

Attempting to put a non-ASCII string as a user metadata value on a S3 object fails due to an explicit prohibition in the boto3 source code: https://github.com/boto/botocore/blob/04d1fae43b657952e49b21d16daa86378ddb4253/botocore/handlers.py#L543

Example:

>obj.put(Body="hello",Metadata={"meta":"™"}) Traceback (most recent call last): ... UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 0: ordinal not in range(128)

The documentation cited in the validation function linked above has changed since the code was written, and now states "Amazon S3 allows arbitrary Unicode characters in your metadata values" (https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html)

The mismatch between what boto does and what S3 claims to support was surprising and wasted some time.

If I monkey-patch this check out, everything works fine and the PUT succeeds, as the documentation suggests it should. When fetching the object, the metadata is UTF-8 encoded and then base64 encoded for ASCII transmission via http headers, again totally as documented.

>import botocore >botocore.handlers.BUILTIN_HANDLERS = [elem for elem in botocore.handlers.BUILTIN_HANDLERS if not (elem[0].startswith('before-parameter-build.s3.') and elem[1] == botocore.handlers.validate_ascii_metadata)]

>sess = boto3.Session() >s3 = session.resource('s3') >obj = s3.Object('[bucket omitted]','testupload.txt') >obj.put(Body="hello",Metadata={"meta":"™"}) {'ResponseMetadata': ...} >obj.get() { ... 'x-amz-meta-meta': '=?UTF-8?B?w6LChMKi?=' ... }

In order to have boto support the documented behavior (which is, admittedly, not great behavior - double-encoding all of the strings) we could revert https://github.com/boto/botocore/pull/861 or else make it check keys only (instead of values).

Alternatively, the Unicode error message could be updated to say that S3's support for Unicode values over REST protocol is incomplete, and therefore boto intentionally does not support setting non-ASCII values because it is not fully round-trip safe.

AMZN-hgoffin commented 2 years ago

Oh, it's even worse than I thought. The double-encoding behavior and other weirdness is fully internal to S3. The example result strings in the documentation are double-encoded, and it's documented to do this.

With the check removed, boto correctly signs and sends the data to S3 using UTF-8 encoding. However, the S3 backend then internally re-encodes the UTF-8 bytes using UTF-8 again.

This is evident through testing PUT with metadata of increasingly long lengths, until the PUT is rejected by the server. The longest possible ASCII metadata in a PUT is six bytes longer than the longest possible metadata which is ASCII except for a single trademark ™ character. The trademark character should only require 3 UTF-8 bytes, but here we are.

Given how broken this is, perhaps the right thing to do is simply to change the comment string and the user-facing exception text and point out that the behavior of non-ASCII metadata is nonsensical and it is intentionally not supported by boto3.

tim-finnigan commented 2 years ago

Hi @AMZN-hgoffin, thanks for reaching out. I think issue https://github.com/boto/boto3/issues/478 and specifically this comment https://github.com/boto/boto3/issues/478#issuecomment-180608544 help give more context here.

The error message I get when attempting your code snippet is:

botocore.exceptions.ParamValidationError: Parameter validation failed:
Non ascii characters found in S3 metadata for key "meta", value: "™".
S3 metadata can only contain ASCII characters.

Which seems to make sense, but to your point on this quote from the S3 documentation:

Amazon S3 allows arbitrary Unicode characters in your metadata values

I think we need more clarity on this from the S3 team. I opened a ticket and think we should use that to track this topic.

tim-finnigan commented 2 years ago

P54643421

tim-finnigan commented 2 years ago

Hi @AMZN-hgoffin - I saw you followed up on the ticket so I think we can keep the conversation focused there with the S3 team. I’ll go ahead and close this issue.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

tim-finnigan commented 2 years ago

Hi @AMZN-hgoffin I’m reopening this issue for discussion. Someone just asked about this topic in the aws-cli repository. I know this is still be discussed internally but I want to make sure the CLI/Python team is kept in the loop and tracking this here, especially if any code or documentation changes are required on our end.

buptwxd2 commented 2 years ago

Thanks @tim-finnigan to add me in this loop. Please keep posted if any update or change made.

tim-finnigan commented 2 years ago

I just heard back from the S3 docs team and the documentation below was added here:

When using non US-ASCII characters in your metadata values, the provided unicode string is examined for non US-ASCII characters. Values of such headers are character decoded as per RFC 2047 before storing and encoded as per RFC 2047 to make them mail-safe before returning. If the string contains only US-ASCII characters, it is presented as is.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

buptwxd2 commented 2 years ago

Thanks @tim-finnigan. When would botocore support this behavior? I tested aws cli with version aws-cli/2.7.3 and seems it still failed.

root@xxx ~# aws --version aws-cli/2.7.3 Python/3.9.11 Linux/3.10.0-1127.el7.x86_64 exe/x86_64.centos.7 prompt/off root@xxx ~# aws s3api put-object --bucket xd-bk-6 --key abc --meta a=中 --profile aws

Parameter validation failed: Non ascii characters found in S3 metadata for key "a", value: "中". S3 metadata can only contain ASCII characters.

tim-finnigan commented 2 years ago

Hi @buptwxd2 I'm going to reopen your original issue and continue the discussion there, as this issue pertains more to the documentation update.

buptwxd2 commented 2 years ago

Hi @tim-finnigan ,Sure. Thanks