Open xsuchy opened 11 years ago
Amazon Glacier does not permit anything non-ASCII in the name. Details are here: http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-archive-post.html
"The description must be less than or equal to 1,024 characters. The allowable characters are 7-bit ASCII without control codes, specifically ASCII values 32—126 decimal or 0x20—0x7E hexadecimal."
The error message you get from glacier-cli is not helpful though, and I will leave this issue open to fix that.
Boto can handle it by passing in decoded UTF-8. So we just have to pass it to boto as unicode and not as ascii. I tested this patch:
diff --git a/glacier.py b/glacier.py
index 784736a..b18d072 100755
--- a/glacier.py
+++ b/glacier.py
@@ -395,6 +395,7 @@ class App(object):
def archive_list(self, args):
archive_list = list(self.cache.get_archive_list(args.vault))
if archive_list:
+ # FIXME problem here
print(*archive_list, sep="\n")
def archive_upload(self, args):
@@ -412,6 +413,8 @@ class App(object):
raise RuntimeError('Archive name not specified. Use --name')
name = os.path.basename(full_name)
+ if not isinstance(name, unicode):
+ name = name.decode('utf-8')
vault = self.connection.get_vault(args.vault)
archive_id = vault.create_archive_from_file(file_obj=args.file, description=name)
self.cache.add_archive(args.vault, name, archive_id)
The second part make uploading work. But archive list
will then fail.
In point of FIXME is in my case the value of archive_list:
[u'./somefile', u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
first item is printed, but when it iterate to second item, it will traceback with:
Traceback (most recent call last):
File "/usr/local/bin/glacier", line 621, in <module>
App().main()
File "/usr/local/bin/glacier", line 607, in main
args.func(args)
File "/usr/local/bin/glacier", line 399, in archive_list
print(*archive_list, sep="\n")
Which is weird to me, because when I'm trying to reproduce it in python console it works:
$ python
Python 2.7.3 (default, Aug 9 2012, 17:23:57)
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import print_function
>>> a=[u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
>>> a
[u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
>>> import iso8601
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
>>> from __future__ import unicode_literals
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
I don't understand. If Amazon will only take ASCII in the range 32-126, how do you expect glacier-cli or boto to encode it for sending to Amazon? If, after a disaster, you use a different tool for recovery, how will that tool know how to decode your encoded archive names?
Lets have character 'á' http://www.fileformat.info/info/unicode/char/e1/index.htm If you pass it as is, then boto pass is as U+00E1, which is outside of randge. But if you pass it as u'á' then it will be encoded and passed as '\xc3\xa1' (8 character long string), which is within range. This tranformation is made only for character out of range, characters within 0-128 is left intact. This leaves open small corner case, but I assume no one put \n or \r in name of file name :)
I don't follow. "\xc3" is 195 decimal, which is greater than the Amazon 126 limit, no?
I think I've just understood what you are trying to do and now get what you mean by "8 character long string".
The problem is though that this overloads the backslash character. If Amazon Glacier gives glacier-cli an archive of description '\xc3\xa1' (8 byte long literal), then how does glacier-cli know whether to create a filename of exactly 8 ASCII bytes ['\', 'c', '3', ...] or a filename of exactly 1 UTF-8 'á'?
Fundamentally, glacier-cli is a front end for Amazon Glacier, and Glacier doesn't support Unicode so neither can glacier-cli without introducing ambiguities in decoding which harms interoperability with other tools. So I regret that glacier-cli will never be able to support Unicode archive names by default.
If you want to add functionality so that the user can specify some kind of mapping as a command line option (that won't be default), then I'd be happy to accept that. It would need to either be some accepted standard method or be done in a pluggable way to support multiple mappings, and needs to be free of conversion ambiguities.
Alternatively, a wrapper to glacier-cli might be able to do this, or users could use git-annex which keeps filename metadata in the annex instead of in the special remote.
I do not suppose no one name filenames in utf8 encoded format, but ok. I think having this as option, which is by default off is fine as well. What about --allow-utf8 ?
I find the problem with archive list
so I may able to provide patch soon.
But what encoding would --allow-utf8 use?
I'm just looking at http://docs.python.org/2/library/codecs.html#standard-encodings. Why don't we pick one of these? A suitable one would be a coding that converts from Unicode to something that Amazon Glacier can accept (ie. fits into the range 32-126). How about quopri-codec? Quoted-printable is a fairly standard way of embedding Unicode data into a 7-bit stream, right?
I'd prefer --convert-utf8 to make it clear that what goes into Glacier is being modified in some way.
So then glacier-cli could do a simple name_to_send = local_unicode_name.encode('quopri-codec')
on the way in, and local_unicode_name = name_received.decode('quopri-codec')
on the way back, if (and only if) --convert-utf8 was specified. I'm having some trouble with the details of this, but I hope you get the gist.
How does this sound?
On 11/22/2012 02:14 PM, basak wrote:
I'm just looking at http://docs.python.org/2/library/codecs.html#standard-encodings. Why don't we pick one of these?
But UTF8 is one of these - unicode_escape :)
fits into the range 32-126). How about quopri-codec? Quoted-printable is a fairly standard way of embedding Unicode data into a 7-bit stream, right?
I do not agree. That is standard in email world. But to my experience UTF is standard everywhere else.
I'd prefer --convert-utf8 to make it clear that what goes into Glacier is being modified in some way.
OK, --convert-utf8 then.
Mirek
But UTF-8 is not unicode_escape! We cannot use UTF-8 since Amazon is not 8-bit clean for Glacier archive descriptions. And if we use unicode_escape, then we're limiting our interoperability only to other Python tools.
Is there a common encoding that is 7-bit friendly that is generally accepted and not Python-specific? Apart from quoted-printable, I only see base64 and hex.
On 23.11.2012 09:33, basak wrote:
Is there a common encoding that is 7-bit friendly that is generally accepted and not Python-specific? Apart from quoted-printable, I only see base64 and hex.
Hmm, I think everybody will have different opinion. So what about
--convert-name=CODE
where CODE is anything from
http://docs.python.org/2/library/codecs.html#standard-encodings
And admin itself can decide which one he will use.
I have the patch already ready and checking that, I see that such change is possible and will be in fact trivial. I will have to test it again thou.
That sounds absolutely fine. How about --transcode-names=...
for the name of the option? That would be a even more specific about what it actually does (now that we know!), and in some cases more than one name is being converted.
--transcode-names= I agree with you.
I will send pull request on Monday.
This is just byting me. What is the status of this request?
I have filename "./2009/Agátka ve školce/PC090374.JPG" and I'm trying to upload it using: `glacier archive upload --name "./2009/Agátka ve školce/PC090374.JPG" Photos "./2009/Agátka ve školce/PC090374.JPG"``
I end up with traceback:
Not sure if this is problem of boto or glacier-cli.
Will investigate later.