edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
252 stars 98 forks source link

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

Open nemobis opened 6 years ago

nemobis commented 6 years ago

Not all, but a good portion of the records in the associated mrc file, when read, produce the warning "couldn't find 0xa0 in g0=66 g1=69". Is this expected?

>>> record.as_marc()
'01182nam0 22003253i 450 001001100000005001700011010001800028100004100046101000800087102000700095181002000102182001100122200008100133205001700214210003400231215001800265225001000283300003100293300004800324410003200372500004800404676004100452700003700493702004000530790004800570801002800618850001900646950017800665977001300843\x1eMIL0864540\x1e20180302002150.0\x1e  \x1fa9788804642091\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faRaccolto di sangue\x1fe[thriller]\x1ffSharon Bolton\x1fgtraduzione di Manuela Faimali\x1e  \x1faEd. speciale\x1e  \x1faMilano\x1fcOscar Mondadori\x1fd2014\x1e  \x1fa453 p.\x1fd20 cm\x1e| \x1faOscar\x1e  \x1faIn copertina: Oscar estate\x1e  \x1faA pagina IV di copertina: ebook disponibile\x1e 0\x1f1001CFI0000102\x1f12001 \x1faOscar\x1e10\x1faHThe Iblood harvest\x1f3UBO3836087\x1f9RAVV580629\x1e  \x1fa823.92\x1f9Narrativa inglese. 2000-\x1fv22\x1e 1\x1faBolton\x1fb, S. J.\x1f3RAVV580629\x1f4070\x1e 1\x1faFaimali\x1fb, Manuela\x1f3LO1V356745\x1f4070\x1e 1\x1faBolton\x1fb, Sharon\x1f3CFIV315469\x1fzBolton, S. J.\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4698\x1fe ELAPE0001648725  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
couldn't find 0xa0 in g0=66 g1=69
couldn't find 0xa0 in g0=66 g1=69
>>> record.as_marc()
'01030nam0 22003013i 450 001001100000005001700011010001800028010001800046100004100064101001300105102000700118181002000125182001100145200004200156210002500198215001800223225001600241300003300257410003800290500004800328517003100376676004200407700004100449801002800490850001900518950017800537977001300715\x1eMIL0864555\x1e20180302002150.0\x1e  \x1fa9788856639339\x1e  \x1fa9788856646948\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1fcita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faTutta mia la citt\xa9 \x1ffCarlotta Pistone\x1e  \x1faMilano\x1fcPiemme\x1fd2014\x1e  \x1fa306 p.\x1fd22 cm\x1e| \x1faPiemme voci\x1e  \x1faIn copertina: Milano in love\x1e 0\x1f1001CAG1804037\x1f12001 \x1faPiemme voci\x1e10\x1faTutta mia la citt\xa9 \x1f3LO11530364\x1f9RMLV077939\x1e1 \x1faMilano in love\x1f9BVE0684571\x1e  \x1fa853.92\x1f9Narrativa italiana. 2000-\x1fv22\x1e 1\x1faPistone\x1fb, Carlotta\x1f3RMLV077939\x1f4070\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4725\x1fe ELAPE0001649025  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record.as_dict()
{'fields': [{'001': u'MIL0864555'}, {'005': u'20180302002150.0'}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856639339'}], 'ind2': u' '}}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856646948'}], 'ind2': u' '}}, {'100': {'ind1': u' ', 'subfields': [{u'a': u'20140730d2014    ||||0itac50      ba'}], 'ind2': u' '}}, {'101': {'ind1': u'|', 'subfields': [{u'a': u'ita'}, {u'c': u'ita'}], 'ind2': u' '}}, {'102': {'ind1': u' ', 'subfields': [{u'a': u'it'}], 'ind2': u' '}}, {'181': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'i '}, {u'b': u'xxxe  '}], 'ind2': u'1'}}, {'182': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'n'}], 'ind2': u'1'}}, {'200': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'f': u'Carlotta Pistone'}], 'ind2': u' '}}, {'210': {'ind1': u' ', 'subfields': [{u'a': u'Milano'}, {u'c': u'Piemme'}, {u'd': u'2014'}], 'ind2': u' '}}, {'215': {'ind1': u' ', 'subfields': [{u'a': u'306 p.'}, {u'd': u'22 cm'}], 'ind2': u' '}}, {'225': {'ind1': u'|', 'subfields': [{u'a': u'Piemme voci'}], 'ind2': u' '}}, {'300': {'ind1': u' ', 'subfields': [{u'a': u'In copertina: Milano in love'}], 'ind2': u' '}}, {'410': {'ind1': u' ', 'subfields': [{u'1': u'001CAG1804037'}, {u'1': u'2001 '}, {u'a': u'Piemme voci'}], 'ind2': u'0'}}, {'500': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'3': u'LO11530364'}, {u'9': u'RMLV077939'}], 'ind2': u'0'}}, {'517': {'ind1': u'1', 'subfields': [{u'a': u'Milano in love'}, {u'9': u'BVE0684571'}], 'ind2': u' '}}, {'676': {'ind1': u' ', 'subfields': [{u'a': u'853.92'}, {u'9': u'Narrativa italiana. 2000-'}, {u'v': u'22'}], 'ind2': u' '}}, {'700': {'ind1': u' ', 'subfields': [{u'a': u'Pistone'}, {u'b': u', Carlotta'}, {u'3': u'RMLV077939'}, {u'4': u'070'}], 'ind2': u'1'}}, {'801': {'ind1': u' ', 'subfields': [{u'a': u'IT'}, {u'b': u'IT-000000'}, {u'c': u'20140730'}], 'ind2': u'3'}}, {'850': {'ind1': u' ', 'subfields': [{u'a': u'IT-'}, {u'a': u'IT-MI0185'}], 'ind2': u' '}}, {'950': {'ind1': u' ', 'subfields': [{u'a': u'Arch. della  Produzione Editoriale della Lombardia'}, {u'c': u'1 v.'}, {u'd': u' ELAPE-M     F18                     4725'}, {u'e': u' ELAPE0001649025  VMN                       1 v.'}, {u'f': u'B '}, {u'h': u'20141126'}, {u'i': u'20141126'}], 'ind2': u'0'}}, {'977': {'ind1': u' ', 'subfields': [{u'a': u' EL'}, {u'a': u' NB'}], 'ind2': u' '}}], 'leader': u'01030nam0 22003013i 450 '}

IE001_MIL_EL_00017104.zip

nemobis commented 6 years ago

Ah, the input is UNIMARC. Does this make the report invalid?

edsu commented 6 years ago

I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed.

nemobis commented 6 years ago

I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms.

edsu commented 6 years ago

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python?

nemobis commented 6 years ago

Ed Summers, 13/03/2018 23:24:

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump https://software.indexdata.com/yaz/doc/yaz-marcdump.html and then work with it from python?

Thank you a lot for the suggestion. It's been a while since I last used yaz so I had neglected to consider it. I'll let you know how it goes (if it's relevant for this report; feel free to close as invalid!).

nemobis commented 6 years ago

After yaz-marcdump -i marc -f marc8 -t utf8 -o marc I get

  11710 couldn't find 0xaf in g0=66 g1=69
   7844 couldn't find 0x80 in g0=66 g1=69
   3205 couldn't find 0xbf in g0=66 g1=69
   1335 couldn't find 0xca in g0=66 g1=69
   1175 couldn't find 0xa0 in g0=66 g1=69
   1042 couldn't find 0xcc in g0=66 g1=69
    299 couldn't find 0xbb in g0=66 g1=69
    122 couldn't find 0xbe in g0=66 g1=69
edsu commented 6 years ago

That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8?

nemobis commented 6 years ago

Smaller test case attached, from http://id.sbn.it/bid/BVE0764705

>>> from pymarc import MARCReader
>>> print(MARCReader(open('BVE0764705.marc21.mrc', 'rb')).next().get_fields('650')[0].subfields[3])
Attività professionale
>>> print(MARCReader(open('BVE0764705.unimarc.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©  professionale

Note the UNIMARC has 0 in Leader/09, not a space nor a (cf. https://www.loc.gov/marc/bibliographic/bdleader.html ).

None of the yaz-marcdump conversion options which do something seem to help:

$ for code in iso5426 iso8859-1 marc8; do yaz-marcdump -i marc -o marc -t utf8 -f $code BVE0764705.unimarc.mrc > BVE0764705.unimarc.$code.mrc.new ; done
$ python 
>>> print(MARCReader(open('BVE0764705.unimarc.iso5426.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xcc in g0=66 g1=69
Attivit  professionale
>>> print(MARCReader(open('BVE0764705.unimarc.marc8.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
Attivit℗♭ professionale
>>> print(MARCReader(open('BVE0764705.unimarc.iso8859-1.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©℗  professionale

The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):

$ yaz-marcdump -o marcxml -t utf8 BVE0764705.unimarc.mrc | grep 606 -A 4
  <datafield tag="606" ind1=" " ind2=" ">
    <subfield code="a">Operatori turistici</subfield>
    <subfield code="x">Attività professionale</subfield>
    <subfield code="2">FN </subfield>
    <subfield code="3">IT\ICCU\MILC\267308</subfield>
$ yaz-marcdump -t utf8 BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308
$ yaz-marcdump BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308

Sorry if I'm missing something obvious...

BVE0764705.marc21.mrc.gz BVE0764705.unimarc.mrc.gz

josephalway commented 5 years ago

The obscure warning is coming from lines 135-136 of the marc8.py file.

Generally, this section:

            try:
                if code_point > 0x80 and not mb_flag:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g1][code_point]
                else:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g0][code_point]
            except KeyError:
                try:
                    uni = marc8_mapping.ODD_MAP[code_point]
                    uni_list.append(unichr(uni))
                    # we can short circuit because we know these mappings
                    # won't be involved in combinings.  (i hope?)
                    continue
                except KeyError:
                    pass
                if not self.quiet:
                    sys.stderr.write("couldn't find 0x%x in g0=%s g1=%s\n" %
                        (code_point, self.g0, self.g1))

It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing.

A simple change on line 135 would make the error much more human friendly: sys.stderr.write("Unable to read character, couldn't find 0x%x in g0=%s g1=%s\n" %

In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted.

josephalway commented 5 years ago
import pymarc as pym

with open('C:\\Users\\MY_USER\\Downloads\\IE001_MIL_EL_00017104\\IE001_MIL_EL_00017104.mrc', 'rb') as fh:
    reader = pym.MARCReader(fh, to_unicode=True, force_utf8=True)
    for record in reader:
        for field in record.get_fields('020'):
            if field['a'] is not None:
                print(field['a'])
            elif field['a'] is None:
                print('No ISBN')
            else:
                pass

I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors."

From the MARCReader class docstring:

If you find yourself in the unfortunate position of having data that
is utf-8 encoded without the leader set appropriately you can use
the force_utf8 parameter:

    reader = MARCReader(file('file.dat'), to_unicode=True,
        force_utf8=True)
tfmorris commented 4 years ago

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

I think 66. (0x42) & 69. (0x45) are the actually default character sets:

42(hex) [ASCII graphic: B] = Basic Latin (ASCII) 21(hex)45(hex) [ASCII graphics: !E] = Extended Latin (ANSEL) (the 21(hex) technically is a second character of the Intermediate segment of this escape sequence.)

per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066

Based on the comment above: https://github.com/edsu/pymarc/issues/114#issuecomment-446785726 it sounds like the MARC file contains UTF-8 encoded characters.