ahankinson / pybagit

Python library for manipulating bagit files.
http://ahankinson.github.io/pybagit
Other
20 stars 8 forks source link

UTF8 in file names not supported #4

Closed RoelWKramer closed 8 years ago

RoelWKramer commented 9 years ago

When i try to bag a set of files with unicode chars in it, pybagit fails.

I expected pybagit to include those filenames in the resulting bag.

Additionally I expect pybagit to crash on this error, but somehow this stacktrace is printed without crashing the app. I think it has to do with pybagit.py calling multichecksum.py with subprocess.popen.

When i try it without unicode filenames, no problems occur.

Example filenames: /dir/Kopiâren.htm /dir/tëst.gif

The stacktrace: Traceback (most recent call last): File "/srv/tu3/venv/lib/python2.7/site-packages/pybagit/multichecksum.py", line 111, in write_manifest(args[0], ENCODING) File "/srv/tu3/venv/lib/python2.7/site-packages/pybagit/multichecksum.py", line 54, in write_manifest mfile.write("{0} {1}\n".format(csum, fl)) File "/srv/tu3/venv/lib/python2.7/codecs.py", line 691, in write return self.writer.write(data) File "/srv/tu3/venv/lib/python2.7/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 59: ordinal not in range(128)

ahankinson commented 9 years ago

Thanks Roel! I will look into this today.

RoelWKramer commented 9 years ago

Small update: the files are actually included in the bag. It looks like multichecksum.py fails.

ahankinson commented 9 years ago

Commit 978196e89a4459bf622e1c8846a962add568b18f should fix this. I've also added some tests to check if it's a problem.

Could you please let me know if it solves your problem?

RoelWKramer commented 9 years ago

it might be fixed, but i cant install it using pip. I get a unicode decode error because of the directory names.

I tried this: pip install https://github.com/ahankinson/pybagit/archive/master.zip

This is the stacktrace: Exception: Traceback (most recent call last): File "/srv/tu3/venv/lib/python2.7/site-packages/pip/basecommand.py", line 122, in main status = self.run(options, args) File "/srv/tu3/venv/lib/python2.7/site-packages/pip/commands/install.py", line 278, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/srv/tu3/venv/lib/python2.7/site-packages/pip/req.py", line 1197, in prepare_files do_download, File "/srv/tu3/venv/lib/python2.7/site-packages/pip/req.py", line 1375, in unpack_url self.session, File "/srv/tu3/venv/lib/python2.7/site-packages/pip/download.py", line 582, in unpack_http_url unpack_file(temp_location, location, content_type, link) File "/srv/tu3/venv/lib/python2.7/site-packages/pip/util.py", line 621, in unpack_file unzip_file(filename, location, flatten=not filename.endswith(('.pybundle', '.whl'))) File "/srv/tu3/venv/lib/python2.7/site-packages/pip/util.py", line 492, in unzip_file leading = has_leading_dir(zip.namelist()) and flatten File "/srv/tu3/venv/lib/python2.7/site-packages/pip/util.py", line 232, in has_leading_dir prefix, rest = split_leading_dir(path) File "/srv/tu3/venv/lib/python2.7/site-packages/pip/util.py", line 216, in split_leading_dir path = str(path) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 37: ordinal not in range(128)

Storing debug log for failure in /home/vagrant/.pip/pip.log

I think it is because of these dirs. pybagit-master/test/testbag/data/Kopi+�ren.htm pybagit-master/test/testbag/data/t+�st.gif

wolph commented 8 years ago

I can confirm that this issue has been resolved. @RoelWKramer is actually an (ex-)colleague of mine so I know it works for the current codebase now :)

ahankinson commented 8 years ago

Ok, great. Thanks!