edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
252 stars 98 forks source link

Trouble reading Harvard Open Metadata MARC files (UTF-8 related?) #89

Open viking2917 opened 8 years ago

viking2917 commented 8 years ago

I am trying to use pymarc to read the Harvard Open Metadata MARC files.

Most of the files process ok but some (for example ab.bib.14.20160401.full.mrc) produce errors when processing. The error I am getting is:

Traceback (most recent call last):
  File "domark.py", line 21, in <module>
    for record in reader:
  File "/Library/Python/2.7/site-packages/six.py", line 535, in next
    return type(self).__next__(self)
  File "/Users/markwatkins/Sites/pharvard/pymarc/reader.py", line 97, in __next__
    utf8_handling=self.utf8_handling)
  File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 74, in __init__
    utf8_handling=utf8_handling)
  File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 307, in decode_marc
    code = subfield[0:1].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

The driver code I am using is:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
from pymarc import MARCReader

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)

if len(sys.argv) >= 2:
    files = [sys.argv[1]]

for file in files:
    with open(file, 'rb') as fh:
        reader = MARCReader(fh, utf8_handling='ignore')
        for record in reader:
#            print "%s by %s" % (record.title(), record.author())
            print(record.as_json())

Other MARC processing tools (e.g. MarcEdit seem to process the file with no issues so I think the file is legitimate).

Am I doing something wrong? Is there an issue with pymarc, possibly UTF-8 processing related?

Wooble commented 8 years ago

Can you isolate a single record that's displaying this problem?

From the traceback it appears that there's a subfield code that's not ASCII, which is forbidden by the MARC21 spec.

(If such records exist in the wild, though, pymarc should probably have a way to deal with them. This is one area where there's currently no workaround as far as I know...)

viking2917 commented 8 years ago

Thank you! Working on isolating the record...it's unfortunately a massive binary file. :(. But from the python debugger, it does look like you are correct that there's unicode chars in a subfield, e.g.

(Pdb) entry_tag
u'040'
(Pdb) subs
[u'  ', '\xc4\x81TRCLS', 'beng', 'erda', 'cTRCLS', 'dOCLCO', 'dHMY']
viking2917 commented 8 years ago

Here's what the record looks like dumped to a text file by MarcEdit. It does indeed look like the 040 record has unicode in the subfield code.

If the MARC21 spec indeed forbids this, then this issue should probably be closed, although more tolerant error handling might be helpful.

=LDR  01444cam a2200397Ii 4500
=001  014333604-5
=005  20150806114915.0
=008  150209s2015\\\\ja\a\\\\\\\\\\000\0\jpn\d
=020  \\$a9784480068163
=020  \\$a4480068163
=035  0\$aocn902996729
=040  \\$āTRCLS$beng$erda$cTRCLS$dOCLCO$dHMY
=090  \\$aNA6310$b.A53 2015
=100  1\$6880-01$aAkase, Tatsuzō,$d1946-$eauthor.
=245  10$6880-02$aEki o dezain suru :$bkarā shinsho /$cAkase Tatsuzō.
=264  \1$6880-03$aTōkyō-to Taitō-ku :$bChikuma Shobō,$c2015.
=300  \\$a254 pages :$billustrations ;$c18 cm.
=336  \\$atext$btxt$2rdacontent
=337  \\$aunmediated$bn$2rdamedia
=338  \\$avolume$bnc$2rdacarrier
=490  1\$6880-04$aChikuma shinsho ;$v1112
=650  \0$aRailroad stations$xDesign and construction.
=650  \0$aRailroad stations$vDesigns and plans.
=650  07$6880-05$aEki.$2jlabsh/4
=650  07$6880-06$aShinboru māku.$2jlabsh/4
=880  1\$6100-01$a赤瀬達三,$d1946-$eauthor.
=880  10$6245-02$a駅をデザインする :$bカラー新書 /$c赤瀬達三.
=880  \1$6264-03$a東京都台東区 :$b筑摩書房,$c2015.
=880  1\$6490-04$aちくま新書 ;$v1112
=880  07$6650-05$a駅.$2jlabsh/4
=880  07$6650-06$aシンボルマーク.$2jlabsh/4
=830  \0$6880-07$aChikuma shinsho ;$v1112.
=880  \0$6830-07$aちくま新書 ;$v1112.
=988  \\$a20150327
=049  \\$aHMYY
=906  \\$0MH
edsu commented 8 years ago

It looks like the record is coded as containing Unicode (leader position 9). I forget, why are you using

reader = MARCReader(fh, utf8_handling='ignore')
viking2917 commented 8 years ago

I was using 'ignore' because otherwise the processing would stop when it encountered encoding difficulties. (aside: I wonder if better error handling would be to skip offending records and keep going?). Right now when it throws an exception the script halts. (Caveat: I am a python newbie and likely doing something wrong).

gugek commented 8 years ago

@viking2917: right sadly the ignore parameter doesn't ignore everywhere in the MARC field, and also doesn't ignore in areas where UTF-8 isn't permitted.

ALEPH (the ILS Harvard is on) will let you save a unicode character to a subfield.

I have a branch somewhere that does some of this error handling.this for another project

In record.py:

for subfield in subs[1:]:
    if len(subfield) == 0:
        continue
    try:
        code = subfield[0:1].decode('ascii')
    except UnicodeDecodeError:
        if utf8_handling == 'replace':
            code = unidecode(subfield[0:1].decode(encoding,
                                 utf8_handling))
            message = "tag {0}: utf8 - sf code {1}"\
                                   .format(entry_tag, code)
            if self['001']:
                message = "=001 {0}: ".format(self['001'].data)\
                    + message
            logging.error("{0}".format(message))
        else:
            raise

https://github.com/gugek/pymarc/blob/leader-handling/pymarc/record.py

viking2917 commented 8 years ago

@gugek Thank you! Will give that a go.

Wooble commented 8 years ago

That sounds like a reasonable patch to me (of course, it wouldn't have helped here, with utf8_handling='ignore'; I can't think of a good way to use ignore here since it implies ending up with the subfield code completely blank.)

Decomposing the "ā" and throwing away diacritics would probably do the right thing in this particular case, since I can't see how that's possibly supposed to be anything but $a... But I don't know that's a good general solution :)

viking2917 commented 8 years ago

Yes, 'replace' is probably the right option - I was happy to simply discard records with errors so was using 'ignore', but with 'replace' and this patch, I seem to be able to get pretty much everything.

Thanks everyone!

viking2917 commented 8 years ago

I am not sure how aggressively the Harvard Open Metadata project is being maintained, but I've reported the issue in case someone there is actively maintaining it. (It is a treasure trove of open source book metadata....)

Thanks.

Mark

On Mon, Apr 25, 2016 at 4:44 PM, Jim Nicholls notifications@github.com wrote:

Not that this helps anyone usefully progress, but I did want to point out that this in fact not a valid MARC record.

According to the MARC 21 Specification https://www.loc.gov/marc/specifications/specrecstruc.html:

subfield code "The two-character combination of a delimiter followed by a data element identifier. [...]"

delimiter "ASCII control character 1F(hex) [...]"

data element identifier "A one-character code used to identify individual data elements within a variable field. The data element may be any ASCII lowercase alphabetic, numeric, or graphic symbol except blank."

And according to the Character Sets and Encoding Options https://www.loc.gov/marc/specifications/speccharintro.html section:

ASCII "[...] a 7-bit coded character set [...]"

ASCII numerics "ASCII code points 30(hex) through 39(hex)"

ASCII lowercase alphabetics "ASCII code points 61(hex) through 6F(hex) and 70(hex) through 7A(hex)"

ASCII graphic symbols "The ASCII graphic characters other than numerics, alphabetics, space, and delete. Code points 21(hex) through 2F(hex), 3A(hex) through 3F(hex), 40(hex), 5B(hex) through 5F(hex), 60(hex), and 7B(hex) through 7E(hex) are included."

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/edsu/pymarc/issues/89#issuecomment-214515439

Get great book recommendations at The Hawaii Project http://www.thehawaiiproject.com - my new startup!

mark.watkins1@gmail.com twitter: @viking2917 https://www.twitter.com/viking2917 linkedin: https://www.linkedin.com/in/markwatkins1

josepablog commented 8 years ago

I know this is an old issue, but I'm having the same problem than @viking2917 ...

I'm trying to parse the Harvard Open Metadata db, and I'm running into exceptions. Is there any patch that I could do? My code breaks and I cannot catch the exception to skip the record.

(I do not understand what @gugek suggested)...

Wooble commented 8 years ago

Can you isolate a record that displays the issue and attach it as MARC21? (I'd prefer to avoid trying to manually create a bad record by hand for testing, and I don't think my ILS can create one nor can I do it programmatically in pymarc :) )

viking2917 commented 8 years ago

@josepablog I did finally get around this problem. Here's how I got around it:

First I altered record.py (new file attached), to add some error handling. record.py.txt

Instead of this driver from the github page:

from pymarc import MARCReader
with open('test/marc.dat', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        print(record.title())

I changed the driver to this: (basically, importing codecs and sys.). As I recall I had to install


#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
from pymarc import MARCReader

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)

if len(sys.argv) >= 2:
    files = [sys.argv[1]]

for file in files:
    with open(file, 'rb') as fh:
        reader = MARCReader(fh, utf8_handling='replace')
        for record in reader:
            print(record.as_json())

On my mac, I needed to install a few python package:

sudo pip install unidecode sudo pip install -U six

(I am not 100% sure if I needed to install six or not. Your mileage may vary).

I'm a python n00b and not sure this code is really production-ready, so I did not create a pull request. But it's been working for me. I only half-understand what I did, as I really don't know Python. Good luck!

josepablog commented 8 years ago

@Wooble the file is huge! And my understanding of Marc-21 is extremely limited

I'll give a try to @viking2917 's solution, and hopefully it would work...

Thank you to both!

viking2917 commented 8 years ago

(Aside: I traded some emails with the good folks at Harvard and they said something to the effect that there are in fact the occasional invalid record due to the large, distributed nature of their libraries and data. They did correct the issues I brought to their attention but I think it's a good idea to protect against invalid data where possible.)

josepablog commented 8 years ago

@viking2917 I downloaded the new Harvard db, and I don't have those problems if I use the utf8_handling='ignore' flag ...

viking2917 commented 8 years ago

@josepablog Interesting. That flag helped me get further but didn't solve all my issues. But glad it's working for you! Perhaps something has changed in the meantime....

josepablog commented 8 years ago

I think I declared victory too early.

PyMarc still breaks for these files:

ab.bib.11.20160805.full.mrc ab.bib.13.20160805.full.mrc ab.bib.14.20160805.full.mrc

Wished I knew how to isolate the record, to get some help from @edsu : )

edsu commented 8 years ago

So it sounds like we might need a way to catch all exceptions when reading a record, and keep moving through the records?

josepablog commented 8 years ago

@edsu Yes, I think so

This is an example of the exception I get (I'm using utf8_handling='ignore', which I don't know if it makes sense, but reduces the number of errors):

File "//anaconda/lib/python2.7/site-packages/pymarc/reader.py", line 97, in __next__
    utf8_handling=self.utf8_handling)
  File "//anaconda/lib/python2.7/site-packages/pymarc/record.py", line 74, in __init__
    utf8_handling=utf8_handling)
  File "//anaconda/lib/python2.7/site-packages/pymarc/record.py", line 231, in decode_marc
    self.leader = marc[0:LEADER_LEN].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 17: ordinal not in range(128)

Should I just wrap the whole thing in a catch? Or is there anything smarter to do?

Thank you again for your help, Ed!

Wooble commented 8 years ago

It's fairly annoying to have to do it yourself, since it would require calling next manually instead of just using a for loop.

I don't think changing utf8_handling is likely to help if the problem is in the indicators or the leader. pymarc itself should probably have a way to recover from these better; personally I'm running my fork's "leader_encoding" branch in production because our database itself has broken records. It's probably a good start but not really something I'd want to merge with master at the moment since it's a bit sledgehammery; it just fixes the leaders we have problems with in the obvious way and prints a warning, with no way to select strict behavior.

pbnjay commented 6 years ago

Also having this issue with USDA National Agricultural Library's marc21 downloads. Pymarc 2.9.2 handled these files fine on an old system.

I can share a ~60mb file from this distribution if that helps test issues.

edsu commented 6 years ago

@pbnjay Yes, sharing the data that can be used to demonstrate the problem is key.

On another note, I myself work with MARC records only rarely now. Does it make sense to move pymarc over to the code4lib organization account here on GitHub so it can be maintained/developed without me being a bottleneck?

Wooble commented 6 years ago

(If you can isolate a single problem record instead of a 60MB file that would probably be better, though)

anarchivist commented 6 years ago

In case it's useful, I previously had bodged together a permissive reader version of pymarc.MARCReader: https://gist.github.com/anarchivist/4141681

reeset commented 6 years ago

I'm not sure how python does character reading, but for large sets like this, I'd always recommend taking them out of marc and putting them into xml. You can't trust the encoding bits, and most ILS systems will do wonky things when exporting large sets (that you won't see with small sets). Additionally, they violate MARC21 rules (but not the general ISO 2701 rule structure) so you cannot code rules based on expected field values. When I process directly to XML (at least in how I do it in MarcEdit), I ignore character encoding completely, processing via a binary memory stream and sanitize characters for XML processing. This way I avoid these kinds of character issues. The other option (and MarcEdit does this as well) when MARC processing is have your stream do encoding swapping based on the record read -- but that requires having an algorithm that actually determines character encoding of the record so you can conform the reader to the individual record encoding block, and then convert the data into the encoding expected by the writer.

edsu commented 6 years ago

I would still like to have a test record to play with that demonstrates the particular problem we are dealing with here. If we can't reproduce the problem it's really impossible to get enough traction to fix it.

I do like @anarchivist's idea of adding an option to pymarc.Reader. I'm going to open a new issue for that.

pbnjay commented 6 years ago

I essentially just commented out all the instances of .decode('ascii') in record.py and it works fine now (line 307 was my particular exception also, but I just did them all).

I uploaded two problem files here: https://www.dropbox.com/sh/f4w7nv6e5ghnpmr/AACXD4L-GGqPhbc1YexBc6iea?dl=0 Since they're from the USDA they should be public domain, but just in case I'll unshare them once you have a copy to debug with. I'm not producing these files, just converting them to xml, so it'll probably be easier for someone else who knows what they're doing to isolate them.