gdcc / xoai

OAI-PMH Java Toolkit
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

Service Provider: some OAI-PMH responses are not parseable by XOAI due to XML binding conflicts #141

Closed eduardorep closed 2 weeks ago

eduardorep commented 1 year ago

Hello, I'd like to know if by any chance you faced this issue https://github.com/DSpace/xoai/issues/67 and if so if you resolved it. or have any knowledge that might help us resolve it.

pdurbin commented 1 year ago

@eduardorep hi! Do you have a URL we can harvest from to test this?

eduardorep commented 1 year ago

Sure thing, here you go mate: https://doaj.org/oai?verb=ListRecords&metadataPrefix=oai_dc&setSpec=TENDOkRlcm1hdG9sb2d5

poikilotherm commented 1 year ago

Hi @eduardorep just to make sure I got this right: we are talking about the service provider here, using it to harvest a resource, aye?

I tried to read up on the other issues and it seems like you want this to parse just fine, right? Have you tried using our new service provider yet, as it already has an updated version of Woodstox, which might change things already?

eduardorep commented 1 year ago

Haven't tried yet because we were trying to understand if this would solve our issues. Since using this lib would bring breaking changes I was just trying to understand if this issue had been tackled explicitly. But it seems like your suggestion might be a viable option, thank you very much we will likely try it :) Have a nice one!

poikilotherm commented 1 year ago

Please feel free to come back anytime! This is probably something that would affect Dataverse Installations round the world. Fixing this would definitely be in scope!

jfeio commented 1 year ago

Hey, so we upgraded our XOAI to use this fork, so that we could test out whether the issue described in https://github.com/DSpace/xoai/issues/67 is happening, and I'm afraid it does.

For the records listed in the following response:

https://doaj.org/oai?verb=ListRecords&metadataPrefix=oai_dc&setSpec=TENDOkRlcm1hdG9sb2d5

Processing returns the error "The prefix xsi for attribute xsi schemaLocation associated with an element type oai_dc dc is not bound."

This issue seems to be caused by the fact that the namespace "xmlns:xsi" is only defined in the root OAI-PMH element, and not in each oai_dc:dc element.

While this issue is ultimately caused by a non-compliance of the OAI-PMH specification from DOAJs' part, it would be great if the XOAI parser was able to be configured to ignore namespace errors, or to add namespaces that were defined in the root element on any invalid nodes.

However, I believe this would be a complicated change, and would probably not be relevant for Dataverse. Do correct me if I'm wrong however :)

pdurbin commented 1 year ago

@eduardorep @jfeio are you aware of other systems besides DOAJ that are out of compliance with the spec in this way? I'm wondering how common of a problem this is.

Are either of you interested in creating a pull request? (If so, before you start, I'd like to hear what @landreev and @poikilotherm think.)

eduardorep commented 1 year ago

Yes there is another one, ScieloBR: https://github.com/DOAJ/doaj/issues/2186#issuecomment-1476402391

Their website: https://www.scielo.br/

An example of a list record from that repository: https://oaipmh.scielo.org/br/oai?verb=ListRecords&metadataPrefix=oai_dc

Hope this helps!

poikilotherm commented 1 year ago

Hi @eduardorep and @jfeio !

I looked into this again today and put some thought into it. Dataverse does not always have this problem you describe, as we are not using the record parser in this project, but a custom one.

In the data provider, we had kind of a similar problem: we create some XML files already and wanted to "just include them" in the response. So maybe the same trick would be useful here, too? Would you benefit from using such a CopyElement that would simply transfer the content inside <metadata></metadata> unprocessed?

It would be part of the resulting Record's Metadata. From there, you could make it write to some String or whatever using an XmlWriter.

In terms of configuration when to go this or the other way, the Context you provide to the ServiceProvider can hold the information about your choices here.

poikilotherm commented 1 year ago

@eduardorep @jfeio please also feel free to join us on Zulip to discuss this less async. Here's an invite link, see you on the dev channel!

jfeio commented 1 year ago

Hi @poikilotherm! We actually ended up creating a fork of DSpace/xoai, and we adapted the record parser so that it detects whether any given element contains the "xsi" property without declaring its namespace; if this is true, the parser adds the missing declaration to the offending element before validating it.

This solution is not as generic as the solution you are implementing, but for our purposes, it works fine ;)

poikilotherm commented 1 year ago

Feel free to point me to your implementation or create a pull request. Always happy to add sth like this - the less forks to maintain the better.

pdurbin commented 1 year ago

@jfeio hi! I'm also curious about your implementation. Is the commit online?

poikilotherm commented 2 weeks ago

@eduardorep @jfeio you are still welcome to point us to your commits, so we can add the fix here as well. Even better: create a PR!

For the time being, I'll close this. Feel free to reopen.