LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
218 stars 85 forks source link

Special character in Metadata #99

Closed TomZastrow closed 6 years ago

TomZastrow commented 6 years ago

When adding metadata with special characters, I got an error:

No handlers could be found for logger "bagit"
Traceback (most recent call last):
  File "./housekeeping.py", line 57, in <module>
    bag = bagit.make_bag(ship + folder, metadataContainer)
  File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 146, in make_bag
    _make_tag_file('bag-info.txt', bag_info)
  File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 721, in _make_tag_file
    f.write("%s: %s\n" % (h, txt))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 84: ordinal not in range(128)

Which seems to be here in this function:

def _make_tag_file(bag_info_path, bag_info):
    headers = list(bag_info.keys())
    headers.sort()
    with open(bag_info_path, 'w') as f:
        for h in headers:
            if isinstance(bag_info[h], list):
                for val in bag_info[h]:
                    f.write("%s: %s\n" % (h, val))
            else:
                txt = bag_info[h]
                # strip CR, LF and CRLF so they don't mess up the tag file
                txt = re.sub(r'\n|\r|(\r\n)', '', txt)
                f.write("%s: %s\n" % (h, txt))

Maybe file can be opened for writing Unicode: with open(bag_info_path, 'w', encoding=utf-8) as f:

TomZastrow commented 6 years ago

It could be solved this way:

import io ....

with io.open(bag_info_path, 'w', encoding='utf-8') as f: ...

f.write("%s: %s\n" % (unicode(h), unicode(val))) # two times

acdha commented 6 years ago

Interesting, do you have a test-case against the latest version of bagit which reproduces the failure? I added one to the test-suite but it's not failing and I was wondering whether it was possible that the problem was that your code had unicode data in a regular byte-string.

This changed almost a year ago when #55 was merged and now all of the text I/O uses a simple open_text_file wrapper for codecs.open:

https://github.com/LibraryOfCongress/bagit-python/blob/master/bagit.py#L129

The bit of source above still shows _make_tag_file using with open(bag_info_path, 'w') as f: rather than open_text_file:

https://github.com/LibraryOfCongress/bagit-python/blob/master/bagit.py#L1098

TomZastrow commented 6 years ago

Sorry, it was a problem with the encoding of my Python code file ... adding a

-- coding: utf-8 --

Did the trick.

Thanks again.