Open nemobis opened 6 years ago
Ah, the input is UNIMARC. Does this make the report invalid?
I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed.
I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms.
Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.
https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py
Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python?
Ed Summers, 13/03/2018 23:24:
Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump https://software.indexdata.com/yaz/doc/yaz-marcdump.html and then work with it from python?
Thank you a lot for the suggestion. It's been a while since I last used yaz so I had neglected to consider it. I'll let you know how it goes (if it's relevant for this report; feel free to close as invalid!).
After yaz-marcdump -i marc -f marc8 -t utf8 -o marc
I get
11710 couldn't find 0xaf in g0=66 g1=69
7844 couldn't find 0x80 in g0=66 g1=69
3205 couldn't find 0xbf in g0=66 g1=69
1335 couldn't find 0xca in g0=66 g1=69
1175 couldn't find 0xa0 in g0=66 g1=69
1042 couldn't find 0xcc in g0=66 g1=69
299 couldn't find 0xbb in g0=66 g1=69
122 couldn't find 0xbe in g0=66 g1=69
That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8?
Smaller test case attached, from http://id.sbn.it/bid/BVE0764705
>>> from pymarc import MARCReader
>>> print(MARCReader(open('BVE0764705.marc21.mrc', 'rb')).next().get_fields('650')[0].subfields[3])
Attività professionale
>>> print(MARCReader(open('BVE0764705.unimarc.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit© professionale
Note the UNIMARC has 0
in Leader/09, not a space nor a
(cf. https://www.loc.gov/marc/bibliographic/bdleader.html ).
None of the yaz-marcdump conversion options which do something seem to help:
$ for code in iso5426 iso8859-1 marc8; do yaz-marcdump -i marc -o marc -t utf8 -f $code BVE0764705.unimarc.mrc > BVE0764705.unimarc.$code.mrc.new ; done
$ python
>>> print(MARCReader(open('BVE0764705.unimarc.iso5426.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xcc in g0=66 g1=69
Attivit professionale
>>> print(MARCReader(open('BVE0764705.unimarc.marc8.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
Attivit℗♭ professionale
>>> print(MARCReader(open('BVE0764705.unimarc.iso8859-1.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©℗ professionale
The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):
$ yaz-marcdump -o marcxml -t utf8 BVE0764705.unimarc.mrc | grep 606 -A 4
<datafield tag="606" ind1=" " ind2=" ">
<subfield code="a">Operatori turistici</subfield>
<subfield code="x">Attività professionale</subfield>
<subfield code="2">FN </subfield>
<subfield code="3">IT\ICCU\MILC\267308</subfield>
$ yaz-marcdump -t utf8 BVE0764705.unimarc.mrc | grep 606
606 $a Operatori turistici $x Attività professionale $2 FN $3 IT\ICCU\MILC\267308
$ yaz-marcdump BVE0764705.unimarc.mrc | grep 606
606 $a Operatori turistici $x Attività professionale $2 FN $3 IT\ICCU\MILC\267308
Sorry if I'm missing something obvious...
The obscure warning is coming from lines 135-136 of the marc8.py file.
Generally, this section:
try:
if code_point > 0x80 and not mb_flag:
(uni, cflag) = marc8_mapping.CODESETS[self.g1][code_point]
else:
(uni, cflag) = marc8_mapping.CODESETS[self.g0][code_point]
except KeyError:
try:
uni = marc8_mapping.ODD_MAP[code_point]
uni_list.append(unichr(uni))
# we can short circuit because we know these mappings
# won't be involved in combinings. (i hope?)
continue
except KeyError:
pass
if not self.quiet:
sys.stderr.write("couldn't find 0x%x in g0=%s g1=%s\n" %
(code_point, self.g0, self.g1))
It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing.
A simple change on line 135 would make the error much more human friendly:
sys.stderr.write("Unable to read character, couldn't find 0x%x in g0=%s g1=%s\n" %
In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted.
import pymarc as pym
with open('C:\\Users\\MY_USER\\Downloads\\IE001_MIL_EL_00017104\\IE001_MIL_EL_00017104.mrc', 'rb') as fh:
reader = pym.MARCReader(fh, to_unicode=True, force_utf8=True)
for record in reader:
for field in record.get_fields('020'):
if field['a'] is not None:
print(field['a'])
elif field['a'] is None:
print('No ISBN')
else:
pass
I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors."
From the MARCReader class docstring:
If you find yourself in the unfortunate position of having data that is utf-8 encoded without the leader set appropriately you can use the force_utf8 parameter: reader = MARCReader(file('file.dat'), to_unicode=True, force_utf8=True)
Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.
I think 66. (0x42) & 69. (0x45) are the actually default character sets:
42(hex) [ASCII graphic: B] = Basic Latin (ASCII) 21(hex)45(hex) [ASCII graphics: !E] = Extended Latin (ANSEL) (the 21(hex) technically is a second character of the Intermediate segment of this escape sequence.)
per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066
Based on the comment above: https://github.com/edsu/pymarc/issues/114#issuecomment-446785726 it sounds like the MARC file contains UTF-8 encoded characters.
Not all, but a good portion of the records in the associated mrc file, when read, produce the warning "couldn't find 0xa0 in g0=66 g1=69". Is this expected?
IE001_MIL_EL_00017104.zip