Closed plesubc closed 7 months ago
I may be missing something obvious, but why do you think that the problem above is due to a missing xml declaration? Oh, I see what you mean now [edit: no, that's not what the OP meant] - the xml declaration that is present in the export, but absent in the OAI output. This is in fact "a feature, not a bug": these declarations are stripped on purposes when generating the OAI output. That xml header would in fact make the OAI xml invalid if left in place.
To me it looks like the "XML Parsing Error" in your example is due to the invalid UTF8 characters in the output (after the word "filtering"). I also suspect that it's a result of this bug: https://github.com/IQSS/dataverse/issues/9910 which has since been fixed (in 6.1; can be patched in a previous Dataverse version by dropping the updated OAI library jar in place).
But please note that this is just a guess, we would need to confirm this.
probably repaired by the time you see it
Any chance you could point me to an OAI record that is still similarly broken?
OAI output requires the declaration as cited above in 3.2.1 of the spec.
The first tag output is an XML declaration where the version is always 1.0 and the encoding is always UTF-8, eg: <?xml version="1.0" encoding="UTF-8" ?
For example, this is missing the declaration: https://borealisdata.ca/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SP2/NEPRTA
It is also the record that caused the chaos: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA More specifically, Version 1 of the record contains the "right single quotation mark" which caused consternation.
I can't point you to a similar record that's broken, but you should be able to reproduce it by copying over the metadata from version 1 to wherever you test things: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0
I'm not sure how you get "That xml header would in fact make the OAI xml invalid if left in place." Using the GetRecord verb and having the declaration in place should not invalidate the XML, unless I'm missing something.
"An example of a successful reply to the GetRecord request shown above is of the form:"
<?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-05-01T19:20:30Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:hep-th/9901001"
metadataPrefix="oai_dc">http://an.oa.org/OAI-script</request>
<GetRecord>
<record>
...
</record>
</GetRecord>
</OAI-PMH>
The GetRecord response from Dataverse does not conform to this model.
I didn't write whatever parsed the XML output from Borealisdata.ca. But I did have to find out what was causing the problem in the record, which I then traced to the offending UTF-8 character.
I realize that XML should explicitly be assumed to be UTF-8, but:
As far as I can tell, inserting one line whenever the GetRecord
OAI verb is used should be enough to make it conform to the spec. Oh and it's also missing from ListRecords
, and possibly from other places although I haven't made an exhaustive search.
OAI output requires the declaration as cited above in 3.2.1 of the spec. The first tag output is an XML declaration where the version is always 1.0 and the encoding is always UTF-8, eg: <?xml version="1.0" encoding="UTF-8" ?
The sentence quoted above refers to the xml declaration at the top of the full GetRecord XML output itself... So, we were talking about different things (I was referring to our code going to some trouble stripping these headers somewhere else). But, do note that I opened with an acknowledgment that it was possible I was missing something.
However, having taken another look, I can tell you 100% for sure that the xml error in your original example is most definitely the result of the bug I mentioned (#9910). I can send you more info about that bug; and I will otherwise look into this some more tomorrow.
My apologies for having missed this issue when you opened it last months.
Sorry for adding unnecessary confusion the other day. I can point you to the specific place where that xml declaration is in fact stripped from the output (inside the <metadata>...</metadata>
blocks), but no, that's not relevant to the case at hand.
There are two separate things going on:
...filtering��� metaphor ...
. I understand that it was reasonable to assume that it was the other way around - that the absence of the xml declaration turned a UTF8 character into invalid bytes - but no, that is not the case. The telltale sign/the diagnostic test of this problem being an instance of the peculiar bug I mentioned (9910) is that the garbage sequence occurs at precisely the 1024 byte offset in the metadata fragment:
echo -n '<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering�' | wc
0 54 1026
echo -n '<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering' | wc
0 54 1023
That was the nature of the weird bug, it manifested itself when a multi-byte UTF8 sequence happened to straddle the 1024 bytes offset in the cached metadata record (and only that offset, not the multiples of). If you are curious/have more time to kill, here's a long description of the bug in the xoai repo: https://github.com/gdcc/xoai/issues/188 It was fixed there and the updated library was incorporated into Dataverse in the PR #10012. Please see specifically this comment in issue 9910 on how to patch a pre-6.1 instance of Dataverse for this bug: https://github.com/IQSS/dataverse/issues/9910#issuecomment-1769203311
All the best, -Leo
I opened an issue in the xoai repo (gdcc/xoai#225) for the missing declaration. It seems somewhat redundant, since the server already sends the Content-Type: text/xml;charset=UTF-8
header to the client. But the spec does say it's needed, so I trust the maintainer of the library to make the decision as to whether it's necessary.
Otherwise I'm going to close this issue.
Once again, I am really sorry we didn't get back to you sooner on this. We communicated directly with a couple of other Dataverse instances who reported the bug last fall and helped them patch their installations. But then once it was fixed in 6.1 we just moved on, assuming that everybody would just upgrade - I'm realizing now that was a mistaken assumption.
I don't think it's the libraries job to take care of the prolog. IMHO we'd need to make this change in Dataverse code.
See also my reply at https://github.com/gdcc/xoai/issues/225#issuecomment-2017703891
XML declarations missing from
metadataPrefix=oai_ddi
records.What steps does it take to reproduce the issue?
An OAI harvest on a record using
oai_ddi
. Generic example: https://[DV_URL]/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SUM/IDENTOn OAI harvest as above.
On every occurence.
Record is missing mandatory xml declaration as in the OAI spec section 3.2.1 as per https://www.openarchives.org/OAI/openarchivesprotocol.html
Because of this, records may cause an error
XML Parsing Error: not well-formed
when encountering non-ASCII characters, causing problems with OAI harvest.This would (presumably) affect all records which contain characters outside of ISO-8859-1
XML was expected to be generated without error (notably the DDI export found in the API and Dataverse GUI contains an XML declaration).
Which version of Dataverse are you using?
v5.13 (at https://borealisdata.ca)
As an example of this, here is the output of https://borealisdata.ca/oai?verb=GetRecord&identifier=doi:10.5683/SP2/NEPRTA&metadataPrefix=oai_ddi (2024-02-16, probably repaired by the time you see it), original record at
https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0
The character which causes the failure is the single typographic quote in the title: https://www.codetable.net/decimal/8217
Note that the content of the page is as follows, and is missing the XML declaration: