Open luopc-top opened 2 years ago
Probably there is some error in oai_dc
format when retrieving dataset with doi: doi:10.7910/DVN/3FPJEA
returned xml is cut after opening of <metadata>
tag
@madryk good catch, if I go to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3FPJEA and try various export formats like...
... I get "Export Failed".
If I download the dataset as JSON I see that the title has Unicode Character 'END OF TEXT' (U+0003) at the end:
But is this the only dataset causing problems for @luopc-top ?
I just found out by accident, I haven't tried other datasets.
I also found that the exported json data for dataset doi:10.7910/DVN/3FPJEA has some errors. see https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/29779
@luopc-top @madryk I'm curious if you've experimented with this any further, especially with the Unicode character 'END OF TEXT' (U+0003) thing I mentioned.
I didn't explore it further. The only thing I can add is that I encountered some similar issue in not Dataverse project. In some way we had https://www.fileformat.info/info/unicode/char/0002/index.htm character in our data, which also was making issues when creating xml's. What we did to fix it:
Algorithm which removes unwanted character was based on: https://stackoverflow.com/questions/6198986/how-can-i-replace-non-printable-unicode-characters-in-java (answer https://stackoverflow.com/a/18603020) but instead of replacing to ?
character we just remove them and we left newline and tab character (\n \t
) intact.
Hi, I find that oai-pmh api return error xml data with the url https://dataverse.harvard.edu/oai?verb=ListRecords&resumptionToken=MTo1NTAwfDI6fDM6fDQ6fDU6b2FpX2Rj see the figure below.