Closed snorri closed 7 years ago
Something in your metadata triggered this bug, as I do not see this problem elsewhere. I setup Catmandu logging and saw this problem being reported:
identifier not allowed in MODS::Element::Name at /Users/njfranck/.plenv/versions/5.22.0/lib/perl5/site_perl/5.22.0/MODS/Record.pm line 2423
It was reported in file lib/Catmandu/Importer/OAI/Parser/mods.pm
The code always seems to break in the package above. I'm not sure why.
I see that the record example contains illegal HTML markup in the abstract:
<abstract lang="eng" ><p>Three founder mutations ...
and a XML tag that is not part of the MODS standard:
<recordDateApproved encoding="w3cdtf" >2016-09-18T12:08:24+2:00</recordDateApproved>
What could have happened is that any HTML markup fragments (e.g. unclosed tags) threw the XML parsing in disarrray.
The HTML tags are not a part of the XML markup (the content of abstract is escaped) so it should not be affecting the XML parsing. In the raw XML text it is like this:
`<abstract lang="eng" ><p>Three founder mutations ...`
I am aware of the issues with the identifier and recordDateApproved, but it should still be possible to parse the elements.
Version 0.14 is on its way to CPAN that solves this issue. There was a memory leak in passing the XML to the MODS::Record processor
catmandu convert OAI --url http://lup.lub.lu.se/oai --metadataPrefix mods > mods.txt
dies with segmentation fault when running on Debian GNU/Linux 8.7 (jessie), Perl v5.20.2. It happens after less than one minute of harvesting.
You can see the last record in the output here: http://lup.lub.lu.se/oai?verb=GetRecord&metadataPrefix=mods&identifier=oai:lup.lub.lu.se:ec985f41-4fe0-4b39-b970-72b6741ac60c