gdcc / xoai

OAI-PMH Java Toolkit
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

Problem with the client-side ("serviceprovider") implementation of ListRecords #278

Open landreev opened 1 month ago

landreev commented 1 month ago

It appears that harvesting via ListRecords is broken. The reason we never noticed is that Dataverse OAI client hasn't been using it, relying instead on making a ListIdentifiers call, then calling GetRecord for each non-deleted identifier. I am however working on adding support for harvesting via ListRecords as well, optionally.

To skip directly to the punchline, I believe all it is is this line:

https://github.com/gdcc/xoai/blob/75840059cde4a4398d3755aa41a624bbd2c03412/xoai-service-provider/src/main/java/io/gdcc/xoai/serviceprovider/parsers/MetadataParser.java#L34

The problem being that the <metadata> tag in question has already been parsed by the RecordParser before this parser has been called, here:

https://github.com/gdcc/xoai/blob/75840059cde4a4398d3755aa41a624bbd2c03412/xoai-service-provider/src/main/java/io/gdcc/xoai/serviceprovider/parsers/RecordParser.java#L48

A larger fragment:

https://github.com/gdcc/xoai/blob/75840059cde4a4398d3755aa41a624bbd2c03412/xoai-service-provider/src/main/java/io/gdcc/xoai/serviceprovider/parsers/RecordParser.java#L48-L60

In other words, when it's trying to parse this fragment of a ListRecords response:

<metadata>
<oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>This is my test dataset</dc:title>
<dc:identifier>https://doi.org/10.5072/FK2/U6AEZM</dc:identifier>
<dc:creator>Castro, Eleni</dc:creator>
<dc:publisher>Demo Dataverse</dc:publisher>
<dc:description>This is my dataset</dc:description>
<dc:subject>Social Sciences</dc:subject>
<dc:subject>Test</dc:subject>
<dc:date>2015-04-20</dc:date>
<dc:contributor>Admin, Dataverse</dc:contributor>
</oai_dc:dc>
</metadata>

the content String in line 49 above will only contain the <oai_dc:dc ...> ... </oai_dc:dc> part, and that's where the next parser bombs with

java.util.NoSuchElementException null
StackTrace: 
com.ctc.wstx.evt.WstxEventReader.throwEndOfInput(WstxEventReader.java:511)
com.ctc.wstx.evt.WstxEventReader.nextEvent(WstxEventReader.java:270)
io.gdcc.xoai.xmlio.XmlReader.next(XmlReader.java:129)
io.gdcc.xoai.serviceprovider.parsers.MetadataParser.parse(MetadataParser.java:34)
io.gdcc.xoai.serviceprovider.parsers.RecordParser.parse(RecordParser.java:60)
io.gdcc.xoai.serviceprovider.parsers.ListRecordsParser.next(ListRecordsParser.java:5
8)
io.gdcc.xoai.serviceprovider.handler.ListRecordHandler.nextIteration(ListRecordHandl
er.java:67)
io.gdcc.xoai.serviceprovider.lazy.ItemIterator.hasNext(ItemIterator.java:31)
io.gdcc.xoai.serviceprovider.lazy.ItemIterator.&lt;init&gt;(ItemIterator.java:22)
io.gdcc.xoai.serviceprovider.ServiceProvider.listRecords(ServiceProvider.java:73)
edu.harvard.iq.dataverse.harvest.client.oai.OaiHandler.runListRecords(OaiHandler.java:266)
edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean.harvestOAIviaListRecords(HarvesterServiceBean.java:289)

The fix appears to be as simple as commenting out line 34 in MetadataParser.java 😄. But it would sound prudent to add a test or two that would attempt to parse some example fragments.

landreev commented 1 month ago

Hmm, I actually don't understand what's going on - looking at the existing RecordParser tests, I don't really get how they are passing.

landreev commented 1 month ago

Ok, I see, the tests are passing because of context = new Context().withMetadataTransformer("oai_dc", KnownTransformer.OAI_DC); in the test setup.

landreev commented 1 month ago

@poikilotherm I want to close this issue, since I opened it based on not understanding how that parser was supposed to work. (I warned upfront that that was a possibility) But can I keep it open for just a little longer, just to understand what's going on there. Am I reading it correctly, that XOAI can only harvest metadata for which it has a to_xoai xsl transform?

(as you can see, Dataverse hasn't been using this parser at all)

landreev commented 1 month ago

I may ask for, and/or make a PR adding an extra feature to record processing.