Could not upload files with diacritics in name

xsuchy commented 11 years ago

I have filename "./2009/Agátka ve školce/PC090374.JPG" and I'm trying to upload it using: `glacier archive upload --name "./2009/Agátka ve školce/PC090374.JPG" Photos "./2009/Agátka ve školce/PC090374.JPG"``

I end up with traceback:

Traceback (most recent call last):
  File "/usr/local/bin/glacier", line 618, in <module>
    App().main()
  File "/usr/local/bin/glacier", line 604, in main
    args.func(args)
  File "/usr/local/bin/glacier", line 416, in archive_upload
    archive_id = vault.create_archive_from_file(file_obj=args.file, description=name)
  File "/home/mirek/glacier-cli/boto/glacier/vault.py", line 163, in create_archive_from_file
    part_size=part_size)
  File "/home/mirek/glacier-cli/boto/glacier/vault.py", line 126, in create_archive_writer
    description)
  File "/home/mirek/glacier-cli/boto/glacier/layer1.py", line 479, in initiate_multipart_upload
    response_headers=response_headers)
  File "/home/mirek/glacier-cli/boto/glacier/layer1.py", line 83, in make_request
    raise UnexpectedHTTPResponseError(ok_responses, response)
boto.glacier.exceptions.UnexpectedHTTPResponseError

Not sure if this is problem of boto or glacier-cli.

Will investigate later.

basak commented 11 years ago

Amazon Glacier does not permit anything non-ASCII in the name. Details are here: http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-archive-post.html

"The description must be less than or equal to 1,024 characters. The allowable characters are 7-bit ASCII without control codes, specifically ASCII values 32—126 decimal or 0x20—0x7E hexadecimal."

The error message you get from glacier-cli is not helpful though, and I will leave this issue open to fix that.

xsuchy commented 11 years ago

Boto can handle it by passing in decoded UTF-8. So we just have to pass it to boto as unicode and not as ascii. I tested this patch:

diff --git a/glacier.py b/glacier.py
index 784736a..b18d072 100755
--- a/glacier.py
+++ b/glacier.py
@@ -395,6 +395,7 @@ class App(object):
     def archive_list(self, args):
         archive_list = list(self.cache.get_archive_list(args.vault))
         if archive_list:
+            # FIXME problem here
             print(*archive_list, sep="\n")

     def archive_upload(self, args):
@@ -412,6 +413,8 @@ class App(object):
                 raise RuntimeError('Archive name not specified. Use --name')
             name = os.path.basename(full_name)

+        if not isinstance(name, unicode):
+            name = name.decode('utf-8')
         vault = self.connection.get_vault(args.vault)
         archive_id = vault.create_archive_from_file(file_obj=args.file, description=name)
         self.cache.add_archive(args.vault, name, archive_id)

The second part make uploading work. But archive list will then fail. In point of FIXME is in my case the value of archive_list: [u'./somefile', u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg'] first item is printed, but when it iterate to second item, it will traceback with:

Traceback (most recent call last):
  File "/usr/local/bin/glacier", line 621, in <module>
    App().main()
  File "/usr/local/bin/glacier", line 607, in main
    args.func(args)
  File "/usr/local/bin/glacier", line 399, in archive_list
    print(*archive_list, sep="\n")

Which is weird to me, because when I'm trying to reproduce it in python console it works:

$ python
Python 2.7.3 (default, Aug  9 2012, 17:23:57) 
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import print_function
>>> a=[u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
>>> a
[u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
>>> import iso8601
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
>>> from __future__ import unicode_literals
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg

basak commented 11 years ago

I don't understand. If Amazon will only take ASCII in the range 32-126, how do you expect glacier-cli or boto to encode it for sending to Amazon? If, after a disaster, you use a different tool for recovery, how will that tool know how to decode your encoded archive names?

xsuchy commented 11 years ago

Lets have character 'á' http://www.fileformat.info/info/unicode/char/e1/index.htm If you pass it as is, then boto pass is as U+00E1, which is outside of randge. But if you pass it as u'á' then it will be encoded and passed as '\xc3\xa1' (8 character long string), which is within range. This tranformation is made only for character out of range, characters within 0-128 is left intact. This leaves open small corner case, but I assume no one put \n or \r in name of file name :)

basak commented 11 years ago

I don't follow. "\xc3" is 195 decimal, which is greater than the Amazon 126 limit, no?

basak commented 11 years ago

I think I've just understood what you are trying to do and now get what you mean by "8 character long string".

The problem is though that this overloads the backslash character. If Amazon Glacier gives glacier-cli an archive of description '\xc3\xa1' (8 byte long literal), then how does glacier-cli know whether to create a filename of exactly 8 ASCII bytes ['\', 'c', '3', ...] or a filename of exactly 1 UTF-8 'á'?

Fundamentally, glacier-cli is a front end for Amazon Glacier, and Glacier doesn't support Unicode so neither can glacier-cli without introducing ambiguities in decoding which harms interoperability with other tools. So I regret that glacier-cli will never be able to support Unicode archive names by default.

If you want to add functionality so that the user can specify some kind of mapping as a command line option (that won't be default), then I'd be happy to accept that. It would need to either be some accepted standard method or be done in a pluggable way to support multiple mappings, and needs to be free of conversion ambiguities.

Alternatively, a wrapper to glacier-cli might be able to do this, or users could use git-annex which keeps filename metadata in the annex instead of in the special remote.

xsuchy commented 11 years ago

I do not suppose no one name filenames in utf8 encoded format, but ok. I think having this as option, which is by default off is fine as well. What about --allow-utf8 ?

I find the problem with archive list so I may able to provide patch soon.

basak commented 11 years ago

But what encoding would --allow-utf8 use?

I'm just looking at http://docs.python.org/2/library/codecs.html#standard-encodings. Why don't we pick one of these? A suitable one would be a coding that converts from Unicode to something that Amazon Glacier can accept (ie. fits into the range 32-126). How about quopri-codec? Quoted-printable is a fairly standard way of embedding Unicode data into a 7-bit stream, right?

I'd prefer --convert-utf8 to make it clear that what goes into Glacier is being modified in some way.

So then glacier-cli could do a simple name_to_send = local_unicode_name.encode('quopri-codec') on the way in, and local_unicode_name = name_received.decode('quopri-codec') on the way back, if (and only if) --convert-utf8 was specified. I'm having some trouble with the details of this, but I hope you get the gist.

How does this sound?

xsuchy commented 11 years ago

On 11/22/2012 02:14 PM, basak wrote:

I'm just looking at http://docs.python.org/2/library/codecs.html#standard-encodings. Why don't we pick one of these?

But UTF8 is one of these - unicode_escape :)

fits into the range 32-126). How about quopri-codec? Quoted-printable is a fairly standard way of embedding Unicode data into a 7-bit stream, right?

I do not agree. That is standard in email world. But to my experience UTF is standard everywhere else.

I'd prefer --convert-utf8 to make it clear that what goes into Glacier is being modified in some way.

OK, --convert-utf8 then.

Mirek

basak commented 11 years ago

But UTF-8 is not unicode_escape! We cannot use UTF-8 since Amazon is not 8-bit clean for Glacier archive descriptions. And if we use unicode_escape, then we're limiting our interoperability only to other Python tools.

Is there a common encoding that is 7-bit friendly that is generally accepted and not Python-specific? Apart from quoted-printable, I only see base64 and hex.

xsuchy commented 11 years ago

On 23.11.2012 09:33, basak wrote:

Is there a common encoding that is 7-bit friendly that is generally accepted and not Python-specific? Apart from quoted-printable, I only see base64 and hex.

Hmm, I think everybody will have different opinion. So what about

--convert-name=CODE where CODE is anything from http://docs.python.org/2/library/codecs.html#standard-encodings And admin itself can decide which one he will use.

I have the patch already ready and checking that, I see that such change is possible and will be in fact trivial. I will have to test it again thou.

basak commented 11 years ago

That sounds absolutely fine. How about --transcode-names=... for the name of the option? That would be a even more specific about what it actually does (now that we know!), and in some cases more than one name is being converted.

xsuchy commented 11 years ago

--transcode-names= I agree with you.

I will send pull request on Monday.

nomeata commented 10 years ago

This is just byting me. What is the status of this request?

basak / glacier-cli

Could not upload files with diacritics in name #16