basak / glacier-cli

Command-line interface to Amazon Glacier
Other
617 stars 54 forks source link

UnicodeDecodeError due to non-ASCII chars in key #72

Open jakubgs opened 6 years ago

jakubgs commented 6 years ago

I've encountered this issue with glacier-cli failing due to git-annex mistakenly adding things that look like file extension to the key when using the SHA256E backend. Essentially what it means is that certain files will have characters that look like a file extension appended to the key, even when they might not be part of the extension.

Example:

 % ls 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3 
12. Change The World (feat. 웅산).mp3
 % git annex info 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3
file: 12. Change The World (feat. 웅산).mp3
size: 7.48 megabytes
key: SHA256E-s7479642--957208748ae03fe4fc8d7877b2c9d82b7f31be0726e4a3dec9063b84cc64cf09.웅산.mp3
present: true
 % git annex calckey 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3
SHA256E-s7479642--957208748ae03fe4fc8d7877b2c9d82b7f31be0726e4a3dec9063b84cc64cf09.웅산.mp3

I've opened an issue with git-annex here: https://git-annex.branchable.com/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/

And the will be a fix for the case with brackets, but there are other cases in which a file extension might not be just ASCII. And then this is what happens:

% git annex copy 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3 --to glacier
copy 12. Change The World (feat. 웅산).mp3 (checking glacier...) Traceback (most recent call last):
  File "/usr/local/bin/glacier", line 737, in <module>
    main() 
  File "/usr/local/bin/glacier", line 733, in main
    App().main()
  File "/usr/local/bin/glacier", line 719, in main
    self.args.func()
  File "/usr/local/bin/glacier", line 600, in archive_checkpresent
    self.args.vault, self.args.name)
  File "/usr/local/bin/glacier", line 161, in get_archive_last_seen
    result = self._get_archive_query_by_ref(vault, ref).one()
  File "/usr/local/bin/glacier", line 136, in _get_archive_query_by_ref
    if ref.startswith('id:'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 83: ordinal not in range(128)
(user error (glacier ["--region=eu-west-1","archive","checkpresent","music","--quiet","SHA256E-s7479642--957208748ae03fe4fc8d7877b2c9d82b7f31be0726e4a3dec9063b84cc64cf09.\50885\49328.mp3"] exited 1)) failed
git-annex: copy: 1 failed

Now, As the bug report says, you can avoid this issue by changing your backend from SHA256E to SHA256 to avoid adding extensions. But I think addressing this issue would be good anyway.

joeyh commented 6 years ago

Note that on unix, filenames have no defined encoding. No matter how the locale is set up, any filename can contain most any series of bytes. It would be good to just treat the filename passed to glacier as a binary blob if you can.

basak commented 6 years ago

On Tue, Mar 06, 2018 at 05:49:41PM +0000, Joey Hess wrote:

Note that on unix, filenames have no defined encoding. No matter how the locale is set up, any filename can contain most any series of bytes. It would be good to just treat the filename passed to glacier as a binary blob if you can.

IIRC, AWS Glacier limits "descriptions" to 7 bit printable ASCII, and glacier-cli uses the description as the "name" by default, in order that no state needs to be carried outside Glacier in order to be fully restoreable.

See https://github.com/basak/glacier-cli/issues/16 for another option in resolving this - by asking glacier-cli to use a lossless "encoding".

As far as I understand, there are only two options:

  1. Limit what names glacier-cli is given to 7 bit printable ASCII.
  2. Have glacier-cli encode the names it is given.