Special characters in dataset metadata prevent Dataverse repositories from harvesting from Harvard Dataverse

jggautier commented 4 years ago

Specifically, UNC Dataverse and Demo Dataverse are unable to harvest all records in Harvard Dataverse's "IQSS" set, and Demo Dataverse is unable to harvest all records in Harvard Dataverse's default set. Both the IQSS set and the default set contain metadata records of all datasets deposited in Harvard Dataverse.

UNC's repository (Dataverse version 4.16) has harvested about 20,000 of Harvard Dataverse's 32,000+ datasets from the IQSS set using the oai_ddi metadata format. Don Sizemore let me know that the repository's superuser dashboard reports that the last harvesting attempt was Apr 29, 2018 and that it's been "INPROGRESS" since.

Demo Dataverse (version 4.20) harvested fewer than 14,000 of Harvard Dataverse's datasets from the IQSS set using the dataverse_json format. At the time, May 11, 2020, its superuser dashboard reported that the attempt on May 11, 2020 FAILED. Then I set Demo Dataverse to harvest from the default set, using dataverse_json format, and it also failed, again harvesting fewer than 14,000.

jggautier commented 4 years ago

This might be related to invisible characters in the metadata of datasets. I was asked to try to find special characters that are breaking harvesting, and I've found that datasets with the hexadecimal character 0x0C in their description fields are breaking metadata export in general (in the UI, over API and over OAI-PMH) and making it impossible to edit the dataset metadata, at least in the UI - nothing happens when I click the dataset's Edit Metadata button.

You can find 35 datasets with this character in their description fields by querying the database, copying and pasting:

select datasetversion.dataset_id, concat('https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:', authority, '/', identifier) as dataset_url, 
datasetversion.versionstate, versionnumber, lastupdatetime, datasetfieldvalue.value, dataverse.name
from datasetfield datasetfield1, datasetfield datasetfield2, datasetfieldcompoundvalue, datasetfieldvalue, datasetversion, datasetfieldtype, dvobject, dataset, dataverse
where
-- Join all of the tables
    datasetfield1.id = datasetfieldvalue.datasetfield_id and
    datasetfield1.parentdatasetfieldcompoundvalue_id = datasetfieldcompoundvalue.id and
    datasetfield2.id = datasetfieldcompoundvalue.parentdatasetfield_id and
    datasetfield2.datasetversion_id = datasetversion.id and
    datasetfield2.datasetfieldtype_id = datasetfieldtype.id and
    dvobject.id = datasetversion.dataset_id and
    dataset.id = dvobject.id and
    dataverse.id = dvobject.owner_id and
-- Search in the description field for the special character
    datasetfield1.datasetfieldtype_id = 17 and
    datasetfieldvalue.value ~* '' and
-- Exclude harvested datasets
    harvestingclient_id is null and
-- Exclude any metadata stored in metadata templates
    datasetfield1.template_id is null
order by lastupdatetime desc

The where clause with ~* '' is regex. The invisible character is between the two apostrophes. If you copy and paste into a text editor that shows hidden characters, you should see <0x0C>. I read this character can be used for html colors, so probably people copied text from a webpage and pasted into the metadata field. I can't remember or find a better way to query for these types of characters, for a better idea of how many datasets this is affecting - that is, how many have un-editable metadata and un-exportable metadata. If someone else has suggestions, please share. :)

These three datasets, published earlier this year, are among the 35 :

I can reproduce this problem in Demo Dataverse, so this issue should probably be moved to the general Dataverse GitHub repo, or maybe information in this issue should be moved into an existing issue in the general Dataverse GitHub repo. This doesn't seem to affect dataset publishing, so doesn't seem to fit the scope of https://github.com/IQSS/dataverse/issues/3328.

JingMa87 commented 4 years ago

@jggautier I harvested the IQSS set from the Harvard repo too using dataverse_json and I got a success using version 4.20. I'm investigating the 2668 failures, but I'm wondering what's different about demo dataverse compared to my local version. Do you know this?

FAILED Do you have a log message from the demo dataverse run that I could look into? In the source code the state is only changed to FAILED if an error occurred.

INPROGRESS The INPROGRESS state that UNC had, happens on my local machine when dataverse crashes while I'm harvesting. Or if someone reboots the (web) server while someone's harvesting. I can reproduce the same behavior when I'm deleting. I think this is a very important one to fix, but also a larger issue.

jggautier commented 4 years ago

Thanks for investigating! =)

I don't know what's different about demo dataverse compared to your local Dataverse installation, or at least I can't imagine what difference might help explain why your local installation is able to harvest more datasets than demo dataverse is.

Earlier today I told demo dataverse to harvest Harvard Dataverse's IQSS set again. It's at about 12,000 so far and chugging along. I'll check tomorrow to see how much it gets. I don't know how to get a log message from demo dataverse. @djbrooke, would a developer be available to get @JingMa87 this info?

I wrote in an earlier comment that I think special characters in the metadata of some datasets are causing at least some of these errors. I think that because when you view one of Harvard Dataverse's smaller sets, https://dataverse.harvard.edu/oai?verb=ListRecords&set=Princeton_Authored_Datasets&metadataPrefix=oai_datacite (or any metadataPrefix, like oai_dc), my Firefox browser reports the first error, which I think involves a dataset with metadata that contains the hexadecimal character I mentioned.

I'd be interested to know if the 37 datasets I found with that hexadecimal character are among the 2,668 datasets that your local installation couldn't harvest. If you think that'll be helpful, I can send the DOIs to you (or you could send the 2,668 DOIs to me).

I realize that this issue is also larger than harvesting, since as I've mentioned the special character is preventing the metadata of these problem datasets from being exported in any way: through the UI, with Dataverse's APIs, or over OAI-PMH. Also, these datasets' metadata can't be edited through the UI - people aren't able to update their metadata. So I wonder if this GitHub issue should be reframed so that it gets prioritized a little higher. (More of a question for @djbrooke and @scolapasta).

JingMa87 commented 4 years ago

@jggautier

Special characters When I harvest the Princeton_Authored_Datasets set with prefix oai_dc, I get 266 correct harvests and 1 failure. The failure is an XML with a special character "https://dataverse.harvard.edu/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.7910/DVN/9IYGIX". The most important thing is that the run is successful, so this special character bug doesn't cause issues for other datasets. I also harvested the same set using dataverse_json and the special character isn't in the JSON so this one is processed correctly. I can see if I can fix the special character bug, do you have a list of the characters?

2668 IQSS set failures Like I mentioned before, I harvested the whole IQSS set using dataverse_json and I noticed that this format leads to many failures. The problem is that there's many custom keys that are apparently not allowed. In the following JSON, the typeName "ARCS1" is not allowed: "https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/26258". None of the failures had to do with special characters. As far as I understand the industry wide standard is Dublin Core, so is this a big issue since it's specific to dataverse_json? I could also look into this one if it's important.

Dublin Core Besides harvesting IQSS using dataverse_json, I also tested oai_dc. Below you can find the success message. I checked out the failures and there's nothing strange going on. Almost all of them are references to IDs that don't exist (anymore): https://dataverse.harvard.edu/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.7910/DVN/J6VWVG. I've seen this error message a lot. Basically the ID gets returned when you do a ListIdentifiers call, but when you do a GetRecord call the ID doesn't exist somehow.

jggautier commented 4 years ago

Ah, I see. When testing harvesting from one Dataverse repository to another, I usually use the prefix dataverse_json since it has the least metadata loss. So I tried harvesting the Princeton_Authored_Datasets set again using dataverse_json, and Demo Dataverse failed to harvest all records. But we'd also like non-Dataverse based repositories to harvest this set, so I tested harvesting using oai_dc (harvested 81 records/186 failed) and oai_datacite (harvested 0 records/all failed). Since I can't tell what differences between Demo Dataverse and the local Dataverse installation you're using might be causing our different testing results, I'll ask @djbrooke to see if a developer who can get more info about Demo Dataverse can help. Maybe my using Demo Dataverse isn't the right approach anyway, since it can't really be used to predict how successfully Dataverse repositories can distribute metadata records to other systems, which we have no control over.

2668 IQSS set failures Like I mentioned before, I harvested the whole IQSS set using dataverse_json and I noticed that this format leads to many failures. The problem is that there's many custom keys that are apparently not allowed. In the following JSON, the typeName "ARCS1" is not allowed: "https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/26258".

Trying to make sure I understand this :) Since the format is JSON, and OAI-PMH requires XML, the XML includes an API call, which the the harvesting Dataverse repositories then uses to try to get the JSON metadata, right? And the problem is that the harvesting Dataverse, e.g. your local testing Dataverse, doesn't like these custom keys in the JSON? "ARCS1" is a metadata field from a metadatablock that's only enabled in Harvard Dataverse, so I wouldn't expect other Dataverse installations to know what to do with it. But instead of ignoring it, the harvest fails. Is that right? I can try to test this, too, though I'm becoming less confident in how much help I can provide.

Since we promote using dataverse_json when Dataverse repositories harvest from each other (to reduce metadata loss), this sounds like a big problem for harvesting between Dataverse repositories, but not immediately more urgent than getting harvesting to work when using the Dublin Core and DataCite standards (my immediate need being to help the library from Princeton harvest the Princeton_Authored_Datasets set into their non-Dataverse system).

Thanks again for helping troubleshoot. I know you're working on other Dataverse issues and I hope this has been somewhat helpful to you :)

JingMa87 commented 4 years ago

Since I'm running the newest version of Dataverse, I think that other installations will harvest correctly when they update to this version.

Trying to make sure I understand this :) Since the format is JSON, and OAI-PMH requires XML, the XML includes an API call, which the the harvesting Dataverse repositories then uses to try to get the JSON metadata, right? And the problem is that the harvesting Dataverse, e.g. your local testing Dataverse, doesn't like these custom keys in the JSON? "ARCS1" is a metadata field from a metadatablock that's only enabled in Harvard Dataverse, so I wouldn't expect other Dataverse installations to know what to do with it. But instead of ignoring it, the harvest fails. Is that right? I can try to test this, too, though I'm becoming less confident in how much help I can provide.

This is 100% correct and there's a good chance I can fix this one so I'll look into it. Is there a GitHub issue for this on the main dataverse project? Otherwise I can make one.

JingMa87 commented 4 years ago

@jggautier I made an issue and fix for the unknown types in the dataverse_json: https://github.com/IQSS/dataverse/issues/7056

jggautier commented 4 years ago

Thanks so much @JingMa87!

JingMa87 commented 4 years ago

@jggautier Special characters I looked into this issue, but the OAI-PMH response I'm getting from the Harvard repo using a GetRecord call doesn't give me useful info. Basically the XML response is incomplete and misses data. To find out the specific cause, I'd have to look into the database but I don't have access. So a developer who does, needs to figure this one out. I do want to suggest two things:

To filter out special characters from the text of new datasets
To filter out special characters when harvesting from other repo's

Something like a simple replace function might be enough: https://stackoverflow.com/questions/6198986/how-can-i-replace-non-printable-unicode-characters-in-java

jggautier commented 4 years ago

Thanks for following up on this, and for discovering and working on https://github.com/IQSS/dataverse/issues/7056, which seems like part of the cause of Dataverse repos not being able to harvest from Harvard Dataverse (this broad issue).

I think it'll be helpful to rename this GitHub issue to be specifically about failures caused by special characters in the metadata, so I'll do that.

jggautier commented 4 years ago

sbarbosadataverse commented 1 year ago

What's the likelihood this issue will be fixed with the Harvesting updates in progress? @mreekie @siacus We don't want to add this to the Dataverse Backlog for Harvard Dataverse if they may get fixed by the harvesting updates.

Thanks

IQSS / dataverse.harvard.edu

Special characters in dataset metadata prevent Dataverse repositories from harvesting from Harvard Dataverse #72