LibreCat / Catmandu-OAI

Catmandu modules for working with OAI repositories
https://metacpan.org/release/Catmandu-OAI
3 stars 2 forks source link

Segmentation fault during harvesting using the built-in mods handler in the importer #22

Closed snorri closed 7 years ago

snorri commented 7 years ago

catmandu convert OAI --url http://lup.lub.lu.se/oai --metadataPrefix mods > mods.txt

dies with segmentation fault when running on Debian GNU/Linux 8.7 (jessie), Perl v5.20.2. It happens after less than one minute of harvesting.

You can see the last record in the output here: http://lup.lub.lu.se/oai?verb=GetRecord&metadataPrefix=mods&identifier=oai:lup.lub.lu.se:ec985f41-4fe0-4b39-b970-72b6741ac60c

nicolasfranck commented 7 years ago

Something in your metadata triggered this bug, as I do not see this problem elsewhere. I setup Catmandu logging and saw this problem being reported:

identifier not allowed in MODS::Element::Name at /Users/njfranck/.plenv/versions/5.22.0/lib/perl5/site_perl/5.22.0/MODS/Record.pm line 2423

It was reported in file lib/Catmandu/Importer/OAI/Parser/mods.pm

The code always seems to break in the package above. I'm not sure why.

phochste commented 7 years ago

I see that the record example contains illegal HTML markup in the abstract:

<abstract lang="eng" ><p>Three founder mutations ...

and a XML tag that is not part of the MODS standard:

<recordDateApproved encoding="w3cdtf" >2016-09-18T12:08:24+2:00</recordDateApproved>

What could have happened is that any HTML markup fragments (e.g. unclosed tags) threw the XML parsing in disarrray.

snorri commented 7 years ago

The HTML tags are not a part of the XML markup (the content of abstract is escaped) so it should not be affecting the XML parsing. In the raw XML text it is like this:

`<abstract lang="eng" >&lt;p&gt;Three founder mutations ...`

I am aware of the issues with the identifier and recordDateApproved, but it should still be possible to parse the elements.

phochste commented 7 years ago

Version 0.14 is on its way to CPAN that solves this issue. There was a memory leak in passing the XML to the MODS::Record processor