IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

oai_dc export fails, OAI-PMH error #8306

Open luopc-top opened 2 years ago

luopc-top commented 2 years ago

Hi, I find that oai-pmh api return error xml data with the url https://dataverse.harvard.edu/oai?verb=ListRecords&resumptionToken=MTo1NTAwfDI6fDM6fDQ6fDU6b2FpX2Rj see the figure below. image

madryk commented 2 years ago

Probably there is some error in oai_dc format when retrieving dataset with doi: doi:10.7910/DVN/3FPJEA

https://dataverse.harvard.edu/oai?verb=GetRecord&identifier=doi:10.7910/DVN/3FPJEA&metadataPrefix=oai_dc

returned xml is cut after opening of <metadata> tag

pdurbin commented 2 years ago

@madryk good catch, if I go to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3FPJEA and try various export formats like...

... I get "Export Failed".

If I download the dataset as JSON I see that the title has Unicode Character 'END OF TEXT' (U+0003) at the end:

Screen Shot 2021-12-13 at 8 50 37 AM

But is this the only dataset causing problems for @luopc-top ?

luopc-top commented 2 years ago

I just found out by accident, I haven't tried other datasets.

luopc-top commented 2 years ago

I also found that the exported json data for dataset doi:10.7910/DVN/3FPJEA has some errors. see https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/29779

pdurbin commented 1 year ago

@luopc-top @madryk I'm curious if you've experimented with this any further, especially with the Unicode character 'END OF TEXT' (U+0003) thing I mentioned.

madryk commented 1 year ago

I didn't explore it further. The only thing I can add is that I encountered some similar issue in not Dataverse project. In some way we had https://www.fileformat.info/info/unicode/char/0002/index.htm character in our data, which also was making issues when creating xml's. What we did to fix it:

Algorithm which removes unwanted character was based on: https://stackoverflow.com/questions/6198986/how-can-i-replace-non-printable-unicode-characters-in-java (answer https://stackoverflow.com/a/18603020) but instead of replacing to ? character we just remove them and we left newline and tab character (\n \t) intact.