Harvesting zenodo client fail

stevenferey commented 6 years ago

When I try to add zenodo (https://zenodo.org/oai2d) as harvesting client on dataverse 4.8.6, 4.9.1 and 4.9.2, I get the error :

"Failed to find a global identifier in the OAI_DC XML record"

pdurbin commented 6 years ago

@stevenferey thanks for creating this issue.

Any developer who picks this up should check out the thread on the dataverse-users list as well: https://groups.google.com/d/msg/dataverse-community/Y2QUrZR0c6s/HewhzprODwAJ

stevenferey commented 6 years ago

Hi Philip,

Sorry, I'm coming from this group: message from 10th September 16h31. You then advised me to create a github issue!

pdurbin commented 6 years ago

@stevenferey yes! Thanks for creating this GitHub issue! It's much appreciated. We estimate individual GitHub issues during sprint planning and backlog grooming meetings so it's great to hang an estimate on. You can't see the estimates here but they are visible as a "size" such as 1, 2, 3, 5, etc on our kanban board at https://waffle.io/IQSS/dataverse

stevenferey commented 6 years ago

Thank you for your answer Philip,

to be more precise, here is the complete error stack trace:

pdurbin commented 6 years ago

@stevenferey thanks and for 4.9.2, here's where that "Failed to find a global identifier in the OAI_DC XML record" error is thrown: https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/java/edu/harvard/iq/dataverse/api/imports/ImportGenericServiceBean.java#L229

JingMa87 commented 4 years ago

I found out what the issue is. In the OAI-PMH response the first identifier tag <dc:identifier> has to be a persistent identifier URL. This is a URL that will always work and redirects to an original source. A modern problem in the academic world is that URL references in citations become outdated in a matter of years, this is called "link rot". The persistent identifier solves this issue. There's two websites where dataverse allows you to make a persistent identifier:

The <dc:identifier> tag in the OAI-PMH response should then be something like such: <dc:identifier>https://hdl.handle.net/10411/DHBGAE</dc:identifier> and not something like such: <dc:identifier>https://zenodo.org/record/16445</dc:identifier>.

Valid response https://dataverse.nl/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.34894/0QEEHD

Invalid response view-source:https://zenodo.org/oai2d?verb=GetRecord&identifier=oai:zenodo.org:16445&metadataPrefix=oai_dc

@stevenferey There's two options. Either you contact Zenodo and ask them to use one of the supported persistent identifiers or @pdurbin we'll have to build in support for non-persistent identifiers but that's not a decision I can make. This would depend on the product owner or someone comparable.

pdurbin commented 4 years ago

@JingMa87 thanks for the write up.

@stevenferey if you're willing to contact Zenodo, that's probably the next step. Zenodo definitely supports DOIs but I'm not sure why they aren't apparently coming through over OAI-PMH.

JingMa87 commented 4 years ago

@pdurbin Since this issue is from 2018, I don't think the original poster will still answer. I tasked our functional manager to report the issue to Zenodo. Ironically, at least one of our own repo's has the same issue. I suggest to close this GitHub issue, agreed?

pdurbin commented 4 years ago

@JingMa87 well, how's the error handling on the Dataverse side? Could it be improved? Is there a good message in the Dataverse GUI explaining what's wrong?

JingMa87 commented 4 years ago

@pdurbin The message about the harvesting run is very limited right now, but I can image that a pop up window with more info on the failures would be nice. Let's say you hover over the word "failed" and then a window pops up. When you unhover, it disappears again. Is this also what you have in mind?

pdurbin commented 4 years ago

@JingMa87 yes, something like that but it would be good to get input from people who set up harvesting regularly such as @jggautier who is also a design meeting regular.

jggautier commented 4 years ago

Oh interesting. It might be helpful to know what the failure messages would look like. Would the message for the failure in this GitHub issue be "Failed to find a global identifier in the OAI_DC XML record"? Would all of the messages be that brief?

Could there be a message with more details when the harvest fails and only the word "FAILED" is shown?:

As far as I can tell, the only two message formats where a failure is indicated are:

"SUCCESS; # harvested, # deleted, # failed." (shown in @JingMa87's screenshot)
"FAILED" (shown above)

Any more details installation admins can access through the UI would be better than none I think, but the interaction of hovering over the word "failed", as it appears now, to get a pop up with more info might not be that helpful. I wouldn't know to hover over the word failed.

Other parts of the application use a question mark icon, and hovering over that gets you more details:

Or could the word "failed" be made to look like a link, so the user thinks to move their cursor to it, which would make a tooltip appear with more info? Not sure how accessible this option would be. (That's in @mheppler's wheelhouse)

JingMa87 commented 4 years ago

@jggautier I would rather have a failure message like "The global identifier should start with https://hdl.handle.net/ or https://doi.org/", but yes the idea is that the messages are very succinct.

I think a question mark after the word "FAILED" makes a lot of sense. And another one after the "SUCCESS; # harvested, # deleted, # failed." with more info about the failures. I'll leave it to your design people to figure this out.

JingMa87 commented 4 years ago

@pdurbin I talked to our functional manager and she pointed to the fact that the Dublin Core standard doesn't restrict the dc:identifier element to only DOIs and handles. The XSD has no such restriction (https://www.dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd) and the website of the Dublin Core Metadata Initiative explicitly names ISBN and URN as options (https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/identifier). How do you feel about broadening the allowed dc identifiers?

pdurbin commented 4 years ago

@JingMa87 that's probably a question for @jggautier but are we still talking about a failure to harvest from Zenodo? Please feel free to create a new issue that captures the problem you're having, your use cases. 😄

jggautier commented 4 years ago

How do you feel about broadening the allowed dc identifiers?

Hmmm, I'm a little confused about the problem now. I get that it's not ideal that the PIDs of datasets from Zenodo aren't being used in the Dublin Core metadata they publish over OAI-PMH and that it's technically allowed to put anything as an identifier.

But right now Dataverse is refusing to harvest records in Dublin Core when the identifier isn't a handle or a DOI?

JingMa87 commented 4 years ago

@pdurbin It's definitely related to this issue, since Dataverse can't harvest from Zenodo because of the <dc:identifier>https://zenodo.org/record/16445</dc:identifier>.

@jggautier You're correct, Dataverse doesn't allow you to harvest a PID that's not a handle or DOI. But is this really what we want? Dublin Core allows ISBN and URN too.

jggautier commented 4 years ago

I would say this is too restrictive. Is there a way to tell why the restriction was put into place, or was it unintentionally coded that way? I found a few old notes when harvesting support was being planned, but nothing about a restriction on the type of identifiers allowed. @pdurbin would you know?

Was it meant to ensure that when Dataverses harvest, they display citations that always include a persistent ID?:

If that's the case, then it sounds like solving this problem means asking Zenodo to change what they put in one of their records' dc:identifier elements. It looks like each record has a couple of identifier elements, like the record at https://zenodo.org/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:zenodo.org:3516046. One of that record's identifiers is "10.5281/zenodo.3516046", which looks like the DOI, or part of one but it's missing the protocol and isn't a URL? So would we also want to make sure the identifier is also a URL?

There are other types of persistent IDs I imagine we wouldn't want to restrict, like ARKs. And we'd want to document this restriction in the guides.

JingMa87 commented 4 years ago

@jggautier In the code I couldn't find any documentation on why this restriction exists. Dataverse also allows a dc:identifier value of "doi:10.7939/DVN/07XBZD" btw and only checks the first occurrence of the dc:identifier element, so if the correct URL is in the second occurrence the harvest of the dataset also fails.

jggautier commented 4 years ago

Ah, that makes sense. Thanks. I should have noticed that the identifiers in Dataverse's own records follow the doi:#### format, too.

I looked but couldn't find any best practices around handling identifiers when harvesting records in Dublin Core. Probably don't know where to look. But I'd like to learn if the restriction was intended and why, so I'll stop fruitlessly digging for now.

JingMa87 commented 4 years ago

@jggautier So does it mean that we can't make a conclusion on this problem?

jggautier commented 4 years ago

I'm not able to without more info, although resolving it would be great. Could we wait for next week to hear back from others?

JingMa87 commented 4 years ago

@jggautier Of course! Let me know what the outcome will be.

pdurbin commented 4 years ago

Was it meant to ensure that when Dataverses harvest, they display citations that always include a persistent ID?:

I recently noticed the following comment in the citation code at https://github.com/IQSS/dataverse/blob/v4.20/src/main/java/edu/harvard/iq/dataverse/DataCitation.java#L78

    // The Global Identifier: 
    // It is always part of the citation for the local datasets; 
    // And for *some* harvested datasets. 
    persistentId = getPIDFrom(dsv, dsv.getDataset());

So, no, harvested datasets do not always show a citation in Dataverse.

It would be nice to be able to harvest from Zenodo. I'm not sure which side has to change.

jggautier commented 4 years ago

Thanks! Hmmm, that's one question down! Just noticed also that Harvard Dataverse shows citations for this metadata harvested from DataCite, but the citation doesn't include an identifier.

I still can't find anything out as far as best practices for how dc:identifier should be used in metadata published over OAI-PMH. As far as common practices:

Figshare uses one dc:identifier in its records, but it doesn't follow the doi:xxxx or https://doi.org/### formats: https://api.figshare.com/v2/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:figshare.com:article/12725075.
ICPSR uses multiple dc:identifiers in its records, but the first dc:identifier isn't the DOI: https://www.icpsr.umich.edu/web/pages/membership/or/metdata/oai.html
- When setting up a harvesting server, Dataverse does have an "Archive Type" field, and one of the options is ICPSR. Not sure how that figures into things. See https://github.com/IQSS/dataverse.harvard.edu/issues/63#issue-564974587
Data.gov's harvesting server info is at https://catalog.data.gov/csw?mode=oaipmh&verb=Identify, but I can't list the records in order to see what they look like: https://catalog.data.gov/csw?mode=oaipmh&verb=ListRecords&set=datasets&metadataPrefix=oai_dc. That URL seems to time out or something.
DataCite uses one dc:identifier element following the doi:#### format: https://oai.datacite.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.5259/1.

It seems to me that if we want to support harvesting metadata from non-Dataverse repositories, including Zenodo, either more flexibility is needed or we need to ask and wait for those repositories to follow the same rules (e.g. if there's more than one dc:identifier, the first needs to follow one of the two identifier formats). If no one knows why that restriction was put in place, then removing the restriction seems to make sense now (and we might learn why the restriction was put in place if something unexpected happens).

JingMa87 commented 4 years ago

@jggautier I'd also like to add a repo from the organization I'm working with, which is why this problem is important for us:

EASY uses multiple dc:identifiers in its records, but the first one is a URN: https://easy.dans.knaw.nl/oai/?verb=GetRecord&identifier=oai:easy.dans.knaw.nl:easy-dataset:114236&metadataPrefix=oai_dc

Adding a fix where you check all dc:identifiers for a handle or DOI is super easy to make. Broadening the allowed dc:identifiers is also quite easy fyi.

jggautier commented 4 years ago

Thanks @JingMa87

@pdurbin wrote:

So, no, harvested datasets do not always show a citation in Dataverse.

The only DC metadata that Harvard Dataverse has been able to harvest is from DataCite (in the "SRDA Dataverse"), and Dataverse didn't include the DOI in the blue box in the records' search cards:

Does that mean that Dataverse doesn't really try to construct a citation from elements in harvested DC metadata? I'm not able to get more examples of this by trying to harvest DC metadata from other repositories (harvesting in Demo Dataverse isn't working for me as of this writing). But if this is really the case, then the only reason I could think of for requiring that the first dc:identifier be a DOI - so that the DOI is always included in the blue box - was wrong and doesn't make sense.

So why is Dataverse even looking for a dc:identifier element when trying to harvest DC records? In the DC elements schema, all DC elements are optional and OAI-PMH doesn't require that DC records have dc:identifier elements. If no one knows or if there is no reason, I think the options are to:

Remove the restriction so that when Dataverse tries to harvest DC metadata, it doesn't check for dc:identifier elements.
Remove the restriction but have Dataverse try to find the first dc:identifier that contains a DOI/HDL or URL and add it to the blue box in the record's search card (if it's a DOI/HDL, add it as a clickable URL). If Dataverse can't find a DOI/HDL or URL in a dc:identifier, don't include any dc:identifier value in the blue box. Since as I wrote Dataverse doesn't seem to be adding a DOI to the blue box now, this second option seems like adding functionality and an expansion of scope.

JingMa87 commented 4 years ago

@jggautier Thanks for your answer. I found metadata that adds the dc identifier to the blue box. It's the https://dataverse.nl/oai server with the MaastrichtUniversity set using oai_dc.

Removing the restriction all together does mean that the code will have to undergo a slightly big change. We can probably solve a lot of the problems by just finding a hdl or doi amongst all the dc identifier elements instead of using only the first dc identifier (which is how it currently works). This change would have minor impacts on the codebase.

jggautier commented 4 years ago

Ah, okay. Thanks for finding that example. So when Dataverse is able to harvest DC metadata, sometimes it adds the dc:identifier to the blue box and sometimes it doesn't? I'm curious what "Archive Type" was chosen when that client was set up to harvest oai_dc metadata from DataverseNL. Do you think that makes a difference? For the harvesting client I set up to harvest from SRDA Dataverse, I chose "Generic OAI Archive":

We can probably solve a lot of the problems by just finding a hdl or doi amongst all the dc identifier elements instead of using only the first dc identifier (which is how it currently works). This change would have minor impacts on the codebase.

So the change would be that instead of expecting the first dc:identifier to be a hdl or doi and failing to harvest if it isn't, Dataverse will look for the first dc:identifier that is a hdl or doi and would be less strict about the format. For example, Dataverse will accept https://doi.org/12345/ABCDE or doi:12345/ABCDE, and in the blue box would always display the URL form. Does that sound right?

What happens if none of the dc:identifier elements contain what Dataverse can identify as a doi or hdl? Then the harvest would fail and the dashboard would show an error message similar to what you described earlier (in an earlier comment)?

JingMa87 commented 4 years ago

So when Dataverse is able to harvest DC metadata, sometimes it adds the dc:identifier to the blue box and sometimes it doesn't?

Yes so it seems. In my case it adds the dc identifier to the blue box and in your case it doesn't.

I'm curious what "Archive Type" was chosen when that client was set up to harvest oai_dc metadata from DataverseNL.

I used the "Dataverse v4+" archive type.

For the harvesting client I set up to harvest from SRDA Dataverse, I chose "Generic OAI Archive"

What server URL did you use for your test (I used https://dataverse.nl/oai)? I'm curious about the datasets.

So the change would be that instead of expecting the first dc:identifier to be a hdl or doi and failing to harvest if it isn't, Dataverse will look for the first dc:identifier that is a hdl or doi and would be less strict about the format. For example, Dataverse will accept https://doi.org/12345/ABCDE or doi:12345/ABCDE, and in the blue box would always display the URL form. Does that sound right?

Not exactly. Currently, Dataverse already allows the formats doi:12345/ABCDE, https://doi.org/12345/ABCDE, hdl:12345/ABCDE, https://hdl.handle.net/12345/ABCDE. I also wouldn't change anything about the display of the URL in the blue box. The only change I'd make is to check all dc identifier elements for a doi or hdl instead of only the first element. The rest of the behaviour would be exactly the same as it is now.

What happens if none of the dc:identifier elements contain what Dataverse can identify as a doi or hdl? Then the harvest would fail and the dashboard would show an error message similar to what you described earlier (in an earlier comment)?

If there's no doi or hdl, the harvest of that dataset will fail. Currently the dashboard will just show something like "SUCCESS; 156 harvested, 0 deleted, 3 failed." I think a message in the UI should be addressed in another GitHub issue and accompanying Pull Request.

jggautier commented 4 years ago

What server URL did you use for your test (I used https://dataverse.nl/oai)? I'm curious about the datasets.

To harvest SRDA metadata into Harvard Dataverse (https://dataverse.harvard.edu/dataverse/srda_harvested), we have to use DataCite's OAI-PMH feed (https://oai.datacite.org/oai - the harvesting set is GESIS.SRDA).

So it seems like one of the things Dataverse does when we choose the "Dataverse v4+" Archive Type is add the url form of a DOI or HDL to the blue box in the search cards. When harvesting from non-Dataverse repositories like Zenodo and EASY, we'd use the Archive Type called "Generic OAI Archive" and the search cards for those harvested records would not include the DOI or HDL that Dataverse still needs to finds in the oai_dc metadata.

I think a message in the UI should be addressed in another GitHub issue and accompanying Pull Request.

The Admin Guide's harvesting page has a "What if a run fails" section that tells people to look for a log in the "app server’s default logging directory". So I suppose people who need to know why some or all harvesting failed will know to look for that log or contact someone who knows how to find and interpret the info in it. (Would be helpful in the future to see what types of Dataverse users are setting up harvesting runs and the best place to put more information about failures.)

The only change I'd make is to check all dc identifier elements for a doi or hdl instead of only the first element.

This slightly broadened restriction is okay with me, since it seems that it will let Dataverse repositories harvest from more (but not all) non-Dataverse repositories, like Zenodo and EASY.

It won't let Dataverse harvest from repositories like Figshare, which as I wrote earlier puts "10.6084/m9.figshare.12725075.v1" in the dc:identifier element. I imagine it would be harder to figure out if a string like "10.6084/m9.figshare.12725075.v1" is a DOI or HDL, as opposed to a string that starts with http/s://doi, doi:, https://hdl.handle.net, or hdl:. This is part of why I wanted to know why this restriction was created in the first place.

And since you wrote that the large effort of removing the restriction entirely isn't justified, I'm okay with that maybe being reconsidered if there's ever a request to harvest records from other repositories like Figshare.

@JingMa87, would you be able to submit a PR?

JingMa87 commented 4 years ago

To harvest SRDA metadata into Harvard Dataverse (https://dataverse.harvard.edu/dataverse/srda_harvested), we have to use DataCite's OAI-PMH feed (https://oai.datacite.org/oai - the harvesting set is GESIS.SRDA).

I don't get any sets, see the screenshot below.

The first time I tried, it only returned like 50 sets and I received an error message that said the response took too long.

Anyhow, I've seen the "10.6084/m9.figshare.12725075.v1" handle from the Figshare repo in some other repo's too. In case you don't know whether it's a hdl or a DOI, the hdl proxy should always work.

I did a few tests and in all my cases a DOI proxy also works.

So in any case we could resolve these handles with a hdl URL. I emailed doi.org to find out if their proxy also always works since we prefer DOI. How do you feel about also allowing handles without hdl or doi in them and constructing them as a hdl or doi?

jggautier commented 4 years ago

How do you feel about also allowing handles without hdl or doi in them and constructing them as a hdl or doi?

I had no idea that https://hdl.handle.net/10.6084/m9.figshare.12725075.v1 and https://doi.org/10.6084/m9.figshare.12725075.v1 would point to the same resource. I don't understand it, but that's pretty cool! Could you write more about what you mean when you write that "we could resolve these handles with a hdl URL"?

If Dataverse harvests a record with a dc:identifier that doesn't start with http/s://doi, doi:, https://hdl.handle.net, or hdl:, like 10.6084/m9.figshare.12725075.v1, you're saying that Dataverse could still know that the identifier is a DOI or HDL?

JingMa87 commented 4 years ago

Could you write more about what you mean when you write that "we could resolve these handles with a hdl URL"?

By this I just mean constructing "https://hdl.handle.net/10.6084/m9.figshare.12725075.v1" from "10.6084/m9.figshare.12725075.v1". You can either prepend https://hdl.handle.net/ or https://doi.org/ to the handle.

If Dataverse harvests a record with a dc:identifier that doesn't start with http/s://doi, doi:, https://hdl.handle.net, or hdl:, like 10.6084/m9.figshare.12725075.v1, you're saying that Dataverse could still know that the identifier is a DOI or HDL?

The point is that we don't know whether it's a DOI or HDL but any handle should work with https://hdl.handle.net/ prepended to it. So I could just make a PR that:

Checks all the dc identifier elements for a handle with https://doi, doi:, https://hdl.handle.net, or hdl:
If there's no such handle, we get the one without doi or hdl (like 10.6084/m9.figshare.12725075.v1) and prepend hdl: to it. Then Dataverse will construct the URL https://hdl.handle.net/10.6084/m9.figshare.12725075.v1 from it automatically.

JingMa87 commented 4 years ago

@jggautier I tested the fix locally and managed to successfully harvest the Figshare repo with the format oai_dc and the set portal_895.

jggautier commented 4 years ago

Awesome!

In the PR I see if (otherId.startsWith("10.") && otherId.contains("/")), so that's how Dataverse will guess that a dc:identifier is a hdl or doi when it doesn't start with http/s://doi, doi:, https://hdl.handle.net, or hdl:.

Looks like the PR covers all of the non-Dataverse repositories we listed earlier, including Zenodo. Would you say it's ready for code review?

JingMa87 commented 4 years ago

@jggautier Totally ready!

janvanmansum commented 4 years ago

Hi, interesting discussion.

Have you considered the possibility that an identifier starting with 10. is neither a DOI nor a plain handle, but some other identifier?
DOIs resolve throught the handle system, because DOI uses Handle, see https://www.doi.org/factsheets/DOIHandle.html.

JingMa87 commented 4 years ago

@janvanmansum Are you aware of any identifiers that start with 10., have a / in them and are neither a DOI or a handle? I've checked multiple repo's (Easy, Zenodo, Figshare, Dataverse) and didn't find any.

pdurbin commented 4 years ago

@JingMa87 that might be a good question for https://www.pidforum.org 😄

JingMa87 commented 4 years ago

@pdurbin There it is: https://www.pidforum.org/t/how-do-i-determine-whether-a-string-is-a-valid-handle/1122

JingMa87 commented 4 years ago

@pdurbin I found a way to programmatically resolve a handle and will update the PR.

IQSS / dataverse

Harvesting zenodo client fail #5050