Closed gugek closed 8 years ago
LGTM. It might be nice to eventually also support the ligatures, but a nice small patch that adds characters that are actually being seen in the real records of a real user is preferable as a first step anyway, if adding in everything is going to have to touch a lot more moving parts.
I sort of feel like the whole MARC-8 support should be ripped out and made into a real codec, but I also have no motivation to do that myself since we've been all-unicode for, IIRC, at least 10 years and I've personally never touched a MARC-8 record with either pymarc or MARC::Record in production code.
@wooble: thanks for the comment. Yes I agree. Though I will suggest that we're going to be stuck with MARC-8 for a lot of processes for a while. OCLC still provides MARC-8 only for some of their pipelines (I hear that will change) and other vendors might only have MARC-8. I don't know how to implement a new encoding for codecs
but will look into it later.
I looked at the MARC4J implementation: and, there the existing Library of Congress XML code table is processed to a lookup table. There are then a number of machinations to handle all the ugly MARC out there. So no codec/encoding implementation per se there.
A codecs implementation was discussed in: https://github.com/edsu/pymarc/issues/7.
Pull request for: Issue: https://github.com/edsu/pymarc/issues/84
Added mappings for eszett and the euro sign. Updated the alif character to a new one. Added relevant tests also.
I made a decision not to deal with the ligatures and double tildes. These are double combining characters that are represented with two code points in MARC-8 and one in Unicode. The alternative is to map to the left right combining characters, and that is what exists right now. Otherwise I'd have to get into the guts of marc8.py and get something in that would look to handle all the error conditions: see MARBI Proposal 2004-08 and discussion paper
From: https://memory.loc.gov/diglib/codetables/45.html