Open landreev opened 1 month ago
Hmm, I actually don't understand what's going on - looking at the existing RecordParser tests, I don't really get how they are passing.
Ok, I see, the tests are passing because of
context = new Context().withMetadataTransformer("oai_dc", KnownTransformer.OAI_DC);
in the test setup.
@poikilotherm I want to close this issue, since I opened it based on not understanding how that parser was supposed to work. (I warned upfront that that was a possibility)
But can I keep it open for just a little longer, just to understand what's going on there.
Am I reading it correctly, that XOAI can only harvest metadata for which it has a to_xoai
xsl transform?
(as you can see, Dataverse hasn't been using this parser at all)
I may ask for, and/or make a PR adding an extra feature to record processing.
It appears that harvesting via ListRecords is broken. The reason we never noticed is that Dataverse OAI client hasn't been using it, relying instead on making a ListIdentifiers call, then calling GetRecord for each non-deleted identifier. I am however working on adding support for harvesting via ListRecords as well, optionally.
To skip directly to the punchline, I believe all it is is this line:
https://github.com/gdcc/xoai/blob/75840059cde4a4398d3755aa41a624bbd2c03412/xoai-service-provider/src/main/java/io/gdcc/xoai/serviceprovider/parsers/MetadataParser.java#L34
The problem being that the
<metadata>
tag in question has already been parsed by the RecordParser before this parser has been called, here:https://github.com/gdcc/xoai/blob/75840059cde4a4398d3755aa41a624bbd2c03412/xoai-service-provider/src/main/java/io/gdcc/xoai/serviceprovider/parsers/RecordParser.java#L48
A larger fragment:
https://github.com/gdcc/xoai/blob/75840059cde4a4398d3755aa41a624bbd2c03412/xoai-service-provider/src/main/java/io/gdcc/xoai/serviceprovider/parsers/RecordParser.java#L48-L60
In other words, when it's trying to parse this fragment of a ListRecords response:
the
content
String in line 49 above will only contain the<oai_dc:dc ...> ... </oai_dc:dc>
part, and that's where the next parser bombs withThe fix appears to be as simple as commenting out line 34 in
MetadataParser.java
😄. But it would sound prudent to add a test or two that would attempt to parse some example fragments.