I came across some records in the wild which had the eszett in them and noted that the existing marc8_mapping.py doesn't have a mapping for that character (UTF-8: U+00DF).
It looks like the LC Code Tables for MARC-8 mappings were updated in 2004: see https://memory.loc.gov/diglib/codetables/45.html which might explain how the character (and the Euro symbol) are overlooked.
I can provide an updated file in a pull request.
But there are a a couple of other changes listed that aren't reflected in the mapping:
See:
Revised June 2004 to add the Eszett (M+C7) and the Euro Sign (M+C8) to the
MARC-8 set.
Revised September 2004 to change the mapping from MARC-8 to Unicode for
the Ligature (M+EB and M+EC) from U+FE20 and U+FE21 to U+0361.
Revised September 2004 to change the mapping from MARC-8 to Unicode for
the Double Tilde (M+FA and M+FB) from U+FE22 and U+FE23 to U+0360.
Revised March 2005 to change the mapping from MARC-8 to Unicode for the
Alif (M+2E) from U+02BE to U+02BC.
So the question is how to handle the revised mappings? Just do the right thing right now? Keep doing the old behavior? Its easy enough with the new characters but the changes might be problematic for some?
I came across some records in the wild which had the eszett in them and noted that the existing
marc8_mapping.py
doesn't have a mapping for that character (UTF-8: U+00DF).It looks like the LC Code Tables for MARC-8 mappings were updated in 2004: see https://memory.loc.gov/diglib/codetables/45.html which might explain how the character (and the Euro symbol) are overlooked.
I can provide an updated file in a pull request.
But there are a a couple of other changes listed that aren't reflected in the mapping:
See:
So the question is how to handle the revised mappings? Just do the right thing right now? Keep doing the old behavior? Its easy enough with the new characters but the changes might be problematic for some?