IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 489 forks source link

XML declaration missing from OAI when using oai_ddi #10329

Closed plesubc closed 7 months ago

plesubc commented 8 months ago

XML declarations missing from metadataPrefix=oai_ddi records.

What steps does it take to reproduce the issue?

An OAI harvest on a record using oai_ddi. Generic example: https://[DV_URL]/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SUM/IDENT

On OAI harvest as above.

On every occurence.

Record is missing mandatory xml declaration as in the OAI spec section 3.2.1 as per https://www.openarchives.org/OAI/openarchivesprotocol.html

Because of this, records may cause an error XML Parsing Error: not well-formed when encountering non-ASCII characters, causing problems with OAI harvest.

This would (presumably) affect all records which contain characters outside of ISO-8859-1

XML was expected to be generated without error (notably the DDI export found in the API and Dataverse GUI contains an XML declaration).

Which version of Dataverse are you using?

v5.13 (at https://borealisdata.ca)


As an example of this, here is the output of https://borealisdata.ca/oai?verb=GetRecord&identifier=doi:10.5683/SP2/NEPRTA&metadataPrefix=oai_ddi (2024-02-16, probably repaired by the time you see it), original record at

https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0

XML Parsing Error: not well-formed
Location: https://borealisdata.ca/oai?verb=GetRecord&identifier=doi:10.5683/SP2/NEPRTA&metadataPrefix=oai_ddi
Line Number 1, Column 1622:

The character which causes the failure is the single typographic quote in the title: https://www.codetable.net/decimal/8217

Note that the content of the page is as follows, and is missing the XML declaration:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2024-02-16T19:15:58Z</responseDate><request verb="GetRecord" identifier="doi:10.5683/SP2/NEPRTA" metadataPrefix="oai_ddi">https://borealisdata.ca/oai</request><GetRecord><record><header><identifier>doi:10.5683/SP2/NEPRTA</identifier><datestamp>2023-02-02T07:00:48Z</datestamp><setSpec>SP</setSpec><setSpec>sp_dataverse</setSpec><setSpec>ubc_dataverse</setSpec></header><metadata><codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering��� metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo><IDNo agency="Dryad">doi:10.5061/dryad.41752p5</IDNo></titlStmt><rspStmt><AuthEnty affiliation="University of British Columbia">Germain, Rachel M.</AuthEnty><AuthEnty affiliation="University of Queensland">Mayfield, Margaret M.</AuthEnty><AuthEnty affiliation="University of Toronto">Gilbert, Benjamin</AuthEnty></rspStmt><prodStmt/><distStmt><distrbtr source="archive">Borealis</distrbtr><contact>UBC Library Research Data Team</contact><distDate>2018-07-30</distDate><depDate>2020-06-30</depDate></distStmt><holdings URI="https://doi.org/10.5683/SP2/NEPRTA"/></citation><stdyInfo><subject><keyword xml:lang="en">Other</keyword><keyword vocab="Dryad">annual plants</keyword><keyword vocab="Dryad">fitness differences</keyword><keyword vocab="Dryad">Holocene</keyword></subject><abstract date="2020-06-30">&lt;b>Abstract&lt;/b>&lt;br/>‘Filtering’, or the reduction in species diversity that occurs because not all species can persist in all locations, is thought to unfold hierarchically, controlled by the environment at large scales and competition at small scales. However, the ecological effects of competition and the environment are not independent, and observational approaches preclude investigation into their interplay. We use a demographic approach with 30 plant species to experimentally test (i) the effect of competition on species persistence in two soil moisture environments, and (ii) the effect of environmental conditions on mechanisms underlying competitive coexistence. We find that competitors cause differential species persistence across environments even when effects are lacking in the absence of competition, and that the traits that determine persistence depend on the competitive environment. If our study had been observational and trait-based, we would have erroneously concluded that the environment filters species with low biomass, shallow roots, and small seeds. Changing environmental conditions generated idiosyncratic effects on coexistence outcomes, increasing competitive exclusion of some species while promoting coexistence of others. Our results highlight the importance of considering environmental filtering in light of, rather than in isolation from, competition, and challenge community assembly models and approaches to projecting future species distributions.</abstract><abstract date="2020-06-30">&lt;b>Usage notes&lt;/b>&lt;br />&lt;div class="o-metadata__file-usage-entry">&lt;h4 class="o-heading__level3-file-title">Germain BL data&lt;/h4>&lt;div class="o-metadata__file-description">First worksheet includes the demographic data, second worksheet the trait data. Species codes are expanded in the supplementary materials.&lt;/div>&lt;div class="o-metadata__file-name">&lt;/div>&lt;/div></abstract><sumDscr><geogCover>California</geogCover></sumDscr><notes>&lt;p>&lt;b>Dryad version number:&lt;/b> 1&lt;/p>
&lt;p>&lt;b>Version status:&lt;/b> submitted&lt;/p>
&lt;p>&lt;b>Dryad curation status:&lt;/b> Published&lt;/p>
&lt;p>&lt;b>Sharing link:&lt;/b> https://datadryad.org/stash/share/bEgp01tBpt-ctVM-ZfFa0KdOQT1nXE5FT-DnIRgymho&lt;/p>
&lt;p>&lt;b>Storage size:&lt;/b> 45413&lt;/p>
&lt;p>&lt;b>Visibility:&lt;/b> public&lt;/p></notes></stdyInfo><method><dataColl><sources/></dataColl><anlyInfo/></method><dataAccs><notes type="DVN:TOU" level="dv">This dataset is made available under a Creative Commons CC0 license with the following additional/modified terms and conditions: CC0 Waiver</notes><setAvail/><useStmt/></dataAccs><othrStdyMat><relPubl><citation><biblCit>Article</biblCit></citation><ExtLink URI="https://doi.org/10.1098/rsbl.2018.0460"/></relPubl></othrStdyMat></stdyDscr><otherMat ID="f153762" URI="https://borealisdata.ca/api/access/datafile/153762" level="datafile"><labl>dryad_41752p5.json</labl><txt>Original JSON from Dryad</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/plain;charset=UTF-8</notes></otherMat><otherMat ID="f153761" URI="https://borealisdata.ca/api/access/datafile/153761" level="datafile"><labl>Germain BL data.tab</labl><txt>First worksheet includes the demographic data, second worksheet the trait data. Species codes are expanded in the supplementary materials.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat></codeBook></metadata></record></GetRecord></OAI-PMH>
landreev commented 7 months ago

I may be missing something obvious, but why do you think that the problem above is due to a missing xml declaration? Oh, I see what you mean now [edit: no, that's not what the OP meant] - the xml declaration that is present in the export, but absent in the OAI output. This is in fact "a feature, not a bug": these declarations are stripped on purposes when generating the OAI output. That xml header would in fact make the OAI xml invalid if left in place.

To me it looks like the "XML Parsing Error" in your example is due to the invalid UTF8 characters in the output (after the word "filtering"). I also suspect that it's a result of this bug: https://github.com/IQSS/dataverse/issues/9910 which has since been fixed (in 6.1; can be patched in a previous Dataverse version by dropping the updated OAI library jar in place).

But please note that this is just a guess, we would need to confirm this.

probably repaired by the time you see it

Any chance you could point me to an OAI record that is still similarly broken?

plesubc commented 7 months ago

OAI output requires the declaration as cited above in 3.2.1 of the spec.

The first tag output is an XML declaration where the version is always 1.0 and the encoding is always UTF-8, eg: <?xml version="1.0" encoding="UTF-8" ?

For example, this is missing the declaration: https://borealisdata.ca/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SP2/NEPRTA

It is also the record that caused the chaos: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA More specifically, Version 1 of the record contains the "right single quotation mark" which caused consternation.

I can't point you to a similar record that's broken, but you should be able to reproduce it by copying over the metadata from version 1 to wherever you test things: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0

I'm not sure how you get "That xml header would in fact make the OAI xml invalid if left in place." Using the GetRecord verb and having the declaration in place should not invalidate the XML, unless I'm missing something.

"An example of a successful reply to the GetRecord request shown above is of the form:"

<?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2002-05-01T19:20:30Z</responseDate>
 <request verb="GetRecord" identifier="oai:arXiv.org:hep-th/9901001"
          metadataPrefix="oai_dc">http://an.oa.org/OAI-script</request> 
 <GetRecord>
  <record>
      ...
  </record>
 </GetRecord> 
</OAI-PMH>  

The GetRecord response from Dataverse does not conform to this model.

I didn't write whatever parsed the XML output from Borealisdata.ca. But I did have to find out what was causing the problem in the record, which I then traced to the offending UTF-8 character.

I realize that XML should explicitly be assumed to be UTF-8, but:

As far as I can tell, inserting one line whenever the GetRecord OAI verb is used should be enough to make it conform to the spec. Oh and it's also missing from ListRecords, and possibly from other places although I haven't made an exhaustive search.

landreev commented 7 months ago

OAI output requires the declaration as cited above in 3.2.1 of the spec. The first tag output is an XML declaration where the version is always 1.0 and the encoding is always UTF-8, eg: <?xml version="1.0" encoding="UTF-8" ?

The sentence quoted above refers to the xml declaration at the top of the full GetRecord XML output itself... So, we were talking about different things (I was referring to our code going to some trouble stripping these headers somewhere else). But, do note that I opened with an acknowledgment that it was possible I was missing something.

However, having taken another look, I can tell you 100% for sure that the xml error in your original example is most definitely the result of the bug I mentioned (#9910). I can send you more info about that bug; and I will otherwise look into this some more tomorrow.

My apologies for having missed this issue when you opened it last months.

landreev commented 7 months ago

Sorry for adding unnecessary confusion the other day. I can point you to the specific place where that xml declaration is in fact stripped from the output (inside the <metadata>...</metadata> blocks), but no, that's not relevant to the case at hand.

There are two separate things going on:

  1. You appear to be entirely correct about the OAI-PMH spec requiring the xml declaration. Somehow nobody has noticed this over the years in our OAI output. I kept saying "our code" but, strictly speaking, this output is generated in a third party library (xoai). But it is now maintained by a member of the Dataverse core team and I'll be talking to them about this. We can usually make any changes there and incorporate them into Dataverse fairly quickly.
  2. The absence of this header is NOT what's causing the problem presented in the opening comment. Note that the borealis.ca Dataverse instance almost certainly has numerous other metadata fragments with UTF8 characters, including that specific non-ASCII quote - please do note that it occurs in multiple other places in the record in your example, being properly displayed! - and the OAI records produced for most of them are perfectly fine, well-formed and parsable. What makes the specific record in your example not well-formed is the presence of junk bytes - binary characters not forming valid UTF8 sequences. In the quoted output they are turned into UTF8 "invalid character" symbols: ...filtering��� metaphor .... I understand that it was reasonable to assume that it was the other way around - that the absence of the xml declaration turned a UTF8 character into invalid bytes - but no, that is not the case. The telltale sign/the diagnostic test of this problem being an instance of the peculiar bug I mentioned (9910) is that the garbage sequence occurs at precisely the 1024 byte offset in the metadata fragment:
    echo -n '<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering�'  |  wc
       0      54    1026
    echo -n '<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering' | wc 
       0      54    1023

    That was the nature of the weird bug, it manifested itself when a multi-byte UTF8 sequence happened to straddle the 1024 bytes offset in the cached metadata record (and only that offset, not the multiples of). If you are curious/have more time to kill, here's a long description of the bug in the xoai repo: https://github.com/gdcc/xoai/issues/188 It was fixed there and the updated library was incorporated into Dataverse in the PR #10012. Please see specifically this comment in issue 9910 on how to patch a pre-6.1 instance of Dataverse for this bug: https://github.com/IQSS/dataverse/issues/9910#issuecomment-1769203311

All the best, -Leo

landreev commented 7 months ago

I opened an issue in the xoai repo (gdcc/xoai#225) for the missing declaration. It seems somewhat redundant, since the server already sends the Content-Type: text/xml;charset=UTF-8 header to the client. But the spec does say it's needed, so I trust the maintainer of the library to make the decision as to whether it's necessary.

Otherwise I'm going to close this issue.

Once again, I am really sorry we didn't get back to you sooner on this. We communicated directly with a couple of other Dataverse instances who reported the bug last fall and helped them patch their installations. But then once it was fixed in 6.1 we just moved on, assuming that everybody would just upgrade - I'm realizing now that was a mistaken assumption.

poikilotherm commented 6 months ago

I don't think it's the libraries job to take care of the prolog. IMHO we'd need to make this change in Dataverse code.

See also my reply at https://github.com/gdcc/xoai/issues/225#issuecomment-2017703891