edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
252 stars 98 forks source link

Added eszett and euro sign; revised alif to MARC-8 to UTF8 conversion #85

Closed gugek closed 8 years ago

gugek commented 8 years ago

Pull request for: Issue: https://github.com/edsu/pymarc/issues/84

Added mappings for eszett and the euro sign. Updated the alif character to a new one. Added relevant tests also.

I made a decision not to deal with the ligatures and double tildes. These are double combining characters that are represented with two code points in MARC-8 and one in Unicode. The alternative is to map to the left right combining characters, and that is what exists right now. Otherwise I'd have to get into the guts of marc8.py and get something in that would look to handle all the error conditions: see MARBI Proposal 2004-08 and discussion paper

From: https://memory.loc.gov/diglib/codetables/45.html

Note 1: The Ligature that spans two characters is constructed of two halves in MARC-8: EB (Ligature, first half) and EC (Ligature, second half). The preferred Unicode/UTF-8 mapping is to the single character Ligature that spans two characters, U+0361. The single character Ligature is encoded between the two characters to be spanned. The two half Ligatures in Unicode, to which the Ligature has been mapped since 1996, are indicated in the mapping as alternatives, but their use is not recommended. It is expected that font support for the single character Ligature mark will be more easily obtained than for the two halves.

Note 2: The Double Tilde that spans two characters is constructed of two halves in MARC-8: FA (Double Tilde, first half) and FB (Double Tilde, second half). The preferred Unicode/UTF-8 mapping is to the single character Double Tilde that spans two characters, U+0360. The single character Double Tilde is encoded between the two characters to be spanned. The two half Double Tildes in Unicode, to which the MARC8 Double Tilde has been mapped since 1996, are indicated in the mapping as alternatives, but their use is not recommended. It is expected that font support for the single character Double Tilde mark will be more easily obtained than for the two halves.

Wooble commented 8 years ago

LGTM. It might be nice to eventually also support the ligatures, but a nice small patch that adds characters that are actually being seen in the real records of a real user is preferable as a first step anyway, if adding in everything is going to have to touch a lot more moving parts.

I sort of feel like the whole MARC-8 support should be ripped out and made into a real codec, but I also have no motivation to do that myself since we've been all-unicode for, IIRC, at least 10 years and I've personally never touched a MARC-8 record with either pymarc or MARC::Record in production code.

gugek commented 8 years ago

@wooble: thanks for the comment. Yes I agree. Though I will suggest that we're going to be stuck with MARC-8 for a lot of processes for a while. OCLC still provides MARC-8 only for some of their pipelines (I hear that will change) and other vendors might only have MARC-8. I don't know how to implement a new encoding for codecs but will look into it later.

I looked at the MARC4J implementation: and, there the existing Library of Congress XML code table is processed to a lookup table. There are then a number of machinations to handle all the ugly MARC out there. So no codec/encoding implementation per se there.

gugek commented 8 years ago

A codecs implementation was discussed in: https://github.com/edsu/pymarc/issues/7.

edsu commented 8 years ago

Thanks for this @gugek, it was just released as v3.1.1