Handle special characters in `s3://` URLs

jwodder commented 6 hours ago

s3invsync takes an aws-style s3://{bucket}/{key}/ URL for the inventory base on the command line, and many of the error messages include S3 URLs as well. However, how should special characters like whitespace be handled in such URLs? Do they need to be percent-encoded?

@yarikoptic Please look into how the aws command handles S3 URLs with keys with special characters and report back.

jwodder commented 6 hours ago

For the record, I found the following sources on S3 URLs, but neither addresses this:

yarikoptic commented 5 hours ago

do you mean on how path component (corresponding to {key}/) in an s3:// URL should be treated when mapping to the {key} location on S3 and subsequently on filesystem? My guess, since it is a URL, it should be urllib.parse.unquote-ed.

BTW, since the tool to be used on a cron job, would it make sense to allow for config file, like you have made for https://github.com/con/tinuous? then input could be just separate bucket_name and optional bucket_prefix (since there is no keys for folders on S3 really) and thus not messing with S3 urls altogether?

yarikoptic commented 5 hours ago

BTW, here is the url I found earlier today on legit keys syntax (was looking if newline is allowed): https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html

jwodder commented 5 hours ago

do you mean on how path component (corresponding to {key}/) in an s3:// URL should be treated when mapping to the {key} location on S3 and subsequently on filesystem? My guess, since it is a URL, it should be urllib.parse.unquote-ed.

Yes, that's what I mean, and it's a reasonable guess, but can you confirm that's how it works? Try creating a test file with a space in its name on S3 and see (a) how aws s3 ls displays it and (b) how aws s3 requires you to specify the file's S3 URL as an argument.

BTW, since the tool to be used on a cron job, would it make sense to allow for config file, like you have made for https://github.com/con/tinuous? then input could be just separate bucket_name and optional bucket_prefix (since there is no keys for folders on S3 really) and thus not messing with S3 urls altogether?

That seems unnecessary.

yarikoptic commented 5 hours ago

Why do we care how aws CLI works -- or are we to interface it?

(note: there is no "s3" protocol, and AFAIK all implementations of handling s3:// URLs are pretty much ad-hoc, so I would not be surprised if they vary in their approaches)

I used this to produce even more obscure name and upload itself into S3:

#!/bin/bash
#
# Script to produce S3 key with obscure name
#
aws s3api put-object --bucket dandiarchive-embargo \
--key '!#$%&'"'"'()*+,-./:;<=>?@[\\]^_\`{|}~azAZ09␣ unicode测试\r' \
--body "$0"

used dandiarchive-embargo since no longer used afaik etc... Ended up with

dandi@drogon:~/bin$ aws s3 ls 's3://dandiarchive-embargo/!#$%&'"'"'()*+,-./' | head
2024-11-18 17:08:38        199 :;<=>?@[\\]^_\`{|}~azAZ09␣ unicode测试\r
andi@drogon:~/bin$ aws s3 ls 's3://dandiarchive-embargo/!#$%&'"'"'()*+,-./:;<=>?@[\\]^_\`{|}~azAZ09␣ unicode 测试\r' | head
2024-11-18 17:08:38        199 :;<=>?@[\\]^_\`{|}~azAZ09␣ unicode测试\r

so taking characters as is, and not carrying to url unquote anything since quoted paths do not work

In [1]: import urllib

In [2]: urllib.parse.quote("!#$%&'()*+,-./")
Out[2]: '%21%23%24%25%26%27%28%29%2A%2B%2C-./'

dandi@drogon:~/bin$ aws s3 ls "s3://dandiarchive-embargo/%21%23%24%25%26%27%28%29%2A%2B%2C-./"
dandi@drogon:~/bin$ aws s3 ls "s3://dandiarchive-embargo/%21"

dandi / s3invsync

Handle special characters in `s3://` URLs #11