oai_dc export fails, OAI-PMH error

IQSS / dataverse

Open source research data repository software

http://dataverse.org

Other

876 stars 484 forks source link

oai_dc export fails, OAI-PMH error #8306

Open luopc-top opened 2 years ago

luopc-top commented 2 years ago

Hi, I find that oai-pmh api return error xml data with the url https://dataverse.harvard.edu/oai?verb=ListRecords&resumptionToken=MTo1NTAwfDI6fDM6fDQ6fDU6b2FpX2Rj see the figure below.

madryk commented 2 years ago

Probably there is some error in oai_dc format when retrieving dataset with doi: doi:10.7910/DVN/3FPJEA

https://dataverse.harvard.edu/oai?verb=GetRecord&identifier=doi:10.7910/DVN/3FPJEA&metadataPrefix=oai_dc

returned xml is cut after opening of <metadata> tag

pdurbin commented 2 years ago

@madryk good catch, if I go to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3FPJEA and try various export formats like...

Dublin Core
DDI
DDI HTML Codebook
OpenAIRE

... I get "Export Failed".

If I download the dataset as JSON I see that the title has Unicode Character 'END OF TEXT' (U+0003) at the end:

Screen Shot 2021-12-13 at 8 50 37 AM

But is this the only dataset causing problems for @luopc-top ?

luopc-top commented 2 years ago

I just found out by accident, I haven't tried other datasets.

luopc-top commented 2 years ago

I also found that the exported json data for dataset doi:10.7910/DVN/3FPJEA has some errors. see https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/29779

pdurbin commented 1 year ago

@luopc-top @madryk I'm curious if you've experimented with this any further, especially with the Unicode character 'END OF TEXT' (U+0003) thing I mentioned.

madryk commented 1 year ago

I didn't explore it further. The only thing I can add is that I encountered some similar issue in not Dataverse project. In some way we had https://www.fileformat.info/info/unicode/char/0002/index.htm character in our data, which also was making issues when creating xml's. What we did to fix it:

wrote flyway java migration that removes unwanted characters
for every string that could be provided by user or imported from external source we run algorithm for removing unwanted characters.

Algorithm which removes unwanted character was based on: https://stackoverflow.com/questions/6198986/how-can-i-replace-non-printable-unicode-characters-in-java (answer https://stackoverflow.com/a/18603020) but instead of replacing to ? character we just remove them and we left newline and tab character (\n \t) intact.