Closed stevenferey closed 4 years ago
@stevenferey thanks for creating this issue.
Any developer who picks this up should check out the thread on the dataverse-users list as well: https://groups.google.com/d/msg/dataverse-community/Y2QUrZR0c6s/HewhzprODwAJ
Hi Philip,
Sorry, I'm coming from this group: message from 10th September 16h31. You then advised me to create a github issue!
@stevenferey yes! Thanks for creating this GitHub issue! It's much appreciated. We estimate individual GitHub issues during sprint planning and backlog grooming meetings so it's great to hang an estimate on. You can't see the estimates here but they are visible as a "size" such as 1, 2, 3, 5, etc on our kanban board at https://waffle.io/IQSS/dataverse
Thank you for your answer Philip,
@stevenferey thanks and for 4.9.2, here's where that "Failed to find a global identifier in the OAI_DC XML record" error is thrown: https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/java/edu/harvard/iq/dataverse/api/imports/ImportGenericServiceBean.java#L229
I found out what the issue is. In the OAI-PMH response the first identifier tag <dc:identifier>
has to be a persistent identifier URL. This is a URL that will always work and redirects to an original source. A modern problem in the academic world is that URL references in citations become outdated in a matter of years, this is called "link rot". The persistent identifier solves this issue. There's two websites where dataverse allows you to make a persistent identifier:
The <dc:identifier>
tag in the OAI-PMH response should then be something like such: <dc:identifier>https://hdl.handle.net/10411/DHBGAE</dc:identifier>
and not something like such: <dc:identifier>https://zenodo.org/record/16445</dc:identifier>
.
Valid response
https://dataverse.nl/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.34894/0QEEHD
Invalid response
view-source:https://zenodo.org/oai2d?verb=GetRecord&identifier=oai:zenodo.org:16445&metadataPrefix=oai_dc
@stevenferey There's two options. Either you contact Zenodo and ask them to use one of the supported persistent identifiers or @pdurbin we'll have to build in support for non-persistent identifiers but that's not a decision I can make. This would depend on the product owner or someone comparable.
@JingMa87 thanks for the write up.
@stevenferey if you're willing to contact Zenodo, that's probably the next step. Zenodo definitely supports DOIs but I'm not sure why they aren't apparently coming through over OAI-PMH.
@pdurbin Since this issue is from 2018, I don't think the original poster will still answer. I tasked our functional manager to report the issue to Zenodo. Ironically, at least one of our own repo's has the same issue. I suggest to close this GitHub issue, agreed?
@JingMa87 well, how's the error handling on the Dataverse side? Could it be improved? Is there a good message in the Dataverse GUI explaining what's wrong?
@pdurbin The message about the harvesting run is very limited right now, but I can image that a pop up window with more info on the failures would be nice. Let's say you hover over the word "failed" and then a window pops up. When you unhover, it disappears again. Is this also what you have in mind?
@JingMa87 yes, something like that but it would be good to get input from people who set up harvesting regularly such as @jggautier who is also a design meeting regular.
Oh interesting. It might be helpful to know what the failure messages would look like. Would the message for the failure in this GitHub issue be "Failed to find a global identifier in the OAI_DC XML record"? Would all of the messages be that brief?
Could there be a message with more details when the harvest fails and only the word "FAILED" is shown?:
As far as I can tell, the only two message formats where a failure is indicated are:
Any more details installation admins can access through the UI would be better than none I think, but the interaction of hovering over the word "failed", as it appears now, to get a pop up with more info might not be that helpful. I wouldn't know to hover over the word failed.
Other parts of the application use a question mark icon, and hovering over that gets you more details:
Or could the word "failed" be made to look like a link, so the user thinks to move their cursor to it, which would make a tooltip appear with more info? Not sure how accessible this option would be. (That's in @mheppler's wheelhouse)
@jggautier I would rather have a failure message like "The global identifier should start with https://hdl.handle.net/ or https://doi.org/", but yes the idea is that the messages are very succinct.
I think a question mark after the word "FAILED" makes a lot of sense. And another one after the "SUCCESS; # harvested, # deleted, # failed." with more info about the failures. I'll leave it to your design people to figure this out.
@pdurbin I talked to our functional manager and she pointed to the fact that the Dublin Core standard doesn't restrict the dc:identifier element to only DOIs and handles. The XSD has no such restriction (https://www.dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd) and the website of the Dublin Core Metadata Initiative explicitly names ISBN and URN as options (https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/identifier). How do you feel about broadening the allowed dc identifiers?
@JingMa87 that's probably a question for @jggautier but are we still talking about a failure to harvest from Zenodo? Please feel free to create a new issue that captures the problem you're having, your use cases. 😄
How do you feel about broadening the allowed dc identifiers?
Hmmm, I'm a little confused about the problem now. I get that it's not ideal that the PIDs of datasets from Zenodo aren't being used in the Dublin Core metadata they publish over OAI-PMH and that it's technically allowed to put anything as an identifier.
But right now Dataverse is refusing to harvest records in Dublin Core when the identifier isn't a handle or a DOI?
@pdurbin It's definitely related to this issue, since Dataverse can't harvest from Zenodo because of the <dc:identifier>https://zenodo.org/record/16445</dc:identifier>
.
@jggautier You're correct, Dataverse doesn't allow you to harvest a PID that's not a handle or DOI. But is this really what we want? Dublin Core allows ISBN and URN too.
I would say this is too restrictive. Is there a way to tell why the restriction was put into place, or was it unintentionally coded that way? I found a few old notes when harvesting support was being planned, but nothing about a restriction on the type of identifiers allowed. @pdurbin would you know?
Was it meant to ensure that when Dataverses harvest, they display citations that always include a persistent ID?:
If that's the case, then it sounds like solving this problem means asking Zenodo to change what they put in one of their records' dc:identifier elements. It looks like each record has a couple of identifier elements, like the record at https://zenodo.org/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:zenodo.org:3516046. One of that record's identifiers is "10.5281/zenodo.3516046", which looks like the DOI, or part of one but it's missing the protocol and isn't a URL? So would we also want to make sure the identifier is also a URL?
There are other types of persistent IDs I imagine we wouldn't want to restrict, like ARKs. And we'd want to document this restriction in the guides.
@jggautier In the code I couldn't find any documentation on why this restriction exists. Dataverse also allows a dc:identifier value of "doi:10.7939/DVN/07XBZD" btw and only checks the first occurrence of the dc:identifier element, so if the correct URL is in the second occurrence the harvest of the dataset also fails.
Ah, that makes sense. Thanks. I should have noticed that the identifiers in Dataverse's own records follow the doi:#### format, too.
I looked but couldn't find any best practices around handling identifiers when harvesting records in Dublin Core. Probably don't know where to look. But I'd like to learn if the restriction was intended and why, so I'll stop fruitlessly digging for now.
@jggautier So does it mean that we can't make a conclusion on this problem?
I'm not able to without more info, although resolving it would be great. Could we wait for next week to hear back from others?
@jggautier Of course! Let me know what the outcome will be.
Was it meant to ensure that when Dataverses harvest, they display citations that always include a persistent ID?:
I recently noticed the following comment in the citation code at https://github.com/IQSS/dataverse/blob/v4.20/src/main/java/edu/harvard/iq/dataverse/DataCitation.java#L78
// The Global Identifier:
// It is always part of the citation for the local datasets;
// And for *some* harvested datasets.
persistentId = getPIDFrom(dsv, dsv.getDataset());
So, no, harvested datasets do not always show a citation in Dataverse.
It would be nice to be able to harvest from Zenodo. I'm not sure which side has to change.
Thanks! Hmmm, that's one question down! Just noticed also that Harvard Dataverse shows citations for this metadata harvested from DataCite, but the citation doesn't include an identifier.
I still can't find anything out as far as best practices for how dc:identifier should be used in metadata published over OAI-PMH. As far as common practices:
https://doi.org/###
formats: https://api.figshare.com/v2/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:figshare.com:article/12725075.It seems to me that if we want to support harvesting metadata from non-Dataverse repositories, including Zenodo, either more flexibility is needed or we need to ask and wait for those repositories to follow the same rules (e.g. if there's more than one dc:identifier, the first needs to follow one of the two identifier formats). If no one knows why that restriction was put in place, then removing the restriction seems to make sense now (and we might learn why the restriction was put in place if something unexpected happens).
@jggautier I'd also like to add a repo from the organization I'm working with, which is why this problem is important for us:
Adding a fix where you check all dc:identifiers for a handle or DOI is super easy to make. Broadening the allowed dc:identifiers is also quite easy fyi.
Thanks @JingMa87
@pdurbin wrote:
So, no, harvested datasets do not always show a citation in Dataverse.
The only DC metadata that Harvard Dataverse has been able to harvest is from DataCite (in the "SRDA Dataverse"), and Dataverse didn't include the DOI in the blue box in the records' search cards:
Does that mean that Dataverse doesn't really try to construct a citation from elements in harvested DC metadata? I'm not able to get more examples of this by trying to harvest DC metadata from other repositories (harvesting in Demo Dataverse isn't working for me as of this writing). But if this is really the case, then the only reason I could think of for requiring that the first dc:identifier be a DOI - so that the DOI is always included in the blue box - was wrong and doesn't make sense.
So why is Dataverse even looking for a dc:identifier element when trying to harvest DC records? In the DC elements schema, all DC elements are optional and OAI-PMH doesn't require that DC records have dc:identifier elements. If no one knows or if there is no reason, I think the options are to:
@jggautier Thanks for your answer. I found metadata that adds the dc identifier to the blue box. It's the https://dataverse.nl/oai server with the MaastrichtUniversity set using oai_dc.
Removing the restriction all together does mean that the code will have to undergo a slightly big change. We can probably solve a lot of the problems by just finding a hdl or doi amongst all the dc identifier elements instead of using only the first dc identifier (which is how it currently works). This change would have minor impacts on the codebase.
Ah, okay. Thanks for finding that example. So when Dataverse is able to harvest DC metadata, sometimes it adds the dc:identifier to the blue box and sometimes it doesn't? I'm curious what "Archive Type" was chosen when that client was set up to harvest oai_dc metadata from DataverseNL. Do you think that makes a difference? For the harvesting client I set up to harvest from SRDA Dataverse, I chose "Generic OAI Archive":
We can probably solve a lot of the problems by just finding a hdl or doi amongst all the dc identifier elements instead of using only the first dc identifier (which is how it currently works). This change would have minor impacts on the codebase.
So the change would be that instead of expecting the first dc:identifier to be a hdl or doi and failing to harvest if it isn't, Dataverse will look for the first dc:identifier that is a hdl or doi and would be less strict about the format. For example, Dataverse will accept https://doi.org/12345/ABCDE or doi:12345/ABCDE, and in the blue box would always display the URL form. Does that sound right?
What happens if none of the dc:identifier elements contain what Dataverse can identify as a doi or hdl? Then the harvest would fail and the dashboard would show an error message similar to what you described earlier (in an earlier comment)?
So when Dataverse is able to harvest DC metadata, sometimes it adds the dc:identifier to the blue box and sometimes it doesn't?
Yes so it seems. In my case it adds the dc identifier to the blue box and in your case it doesn't.
I'm curious what "Archive Type" was chosen when that client was set up to harvest oai_dc metadata from DataverseNL.
I used the "Dataverse v4+" archive type.
For the harvesting client I set up to harvest from SRDA Dataverse, I chose "Generic OAI Archive"
What server URL did you use for your test (I used https://dataverse.nl/oai)? I'm curious about the datasets.
So the change would be that instead of expecting the first dc:identifier to be a hdl or doi and failing to harvest if it isn't, Dataverse will look for the first dc:identifier that is a hdl or doi and would be less strict about the format. For example, Dataverse will accept https://doi.org/12345/ABCDE or doi:12345/ABCDE, and in the blue box would always display the URL form. Does that sound right?
Not exactly. Currently, Dataverse already allows the formats doi:12345/ABCDE, https://doi.org/12345/ABCDE, hdl:12345/ABCDE, https://hdl.handle.net/12345/ABCDE. I also wouldn't change anything about the display of the URL in the blue box. The only change I'd make is to check all dc identifier elements for a doi or hdl instead of only the first element. The rest of the behaviour would be exactly the same as it is now.
What happens if none of the dc:identifier elements contain what Dataverse can identify as a doi or hdl? Then the harvest would fail and the dashboard would show an error message similar to what you described earlier (in an earlier comment)?
If there's no doi or hdl, the harvest of that dataset will fail. Currently the dashboard will just show something like "SUCCESS; 156 harvested, 0 deleted, 3 failed." I think a message in the UI should be addressed in another GitHub issue and accompanying Pull Request.
What server URL did you use for your test (I used https://dataverse.nl/oai)? I'm curious about the datasets.
To harvest SRDA metadata into Harvard Dataverse (https://dataverse.harvard.edu/dataverse/srda_harvested), we have to use DataCite's OAI-PMH feed (https://oai.datacite.org/oai - the harvesting set is GESIS.SRDA).
So it seems like one of the things Dataverse does when we choose the "Dataverse v4+" Archive Type is add the url form of a DOI or HDL to the blue box in the search cards. When harvesting from non-Dataverse repositories like Zenodo and EASY, we'd use the Archive Type called "Generic OAI Archive" and the search cards for those harvested records would not include the DOI or HDL that Dataverse still needs to finds in the oai_dc metadata.
I think a message in the UI should be addressed in another GitHub issue and accompanying Pull Request.
The Admin Guide's harvesting page has a "What if a run fails" section that tells people to look for a log in the "app server’s default logging directory". So I suppose people who need to know why some or all harvesting failed will know to look for that log or contact someone who knows how to find and interpret the info in it. (Would be helpful in the future to see what types of Dataverse users are setting up harvesting runs and the best place to put more information about failures.)
The only change I'd make is to check all dc identifier elements for a doi or hdl instead of only the first element.
This slightly broadened restriction is okay with me, since it seems that it will let Dataverse repositories harvest from more (but not all) non-Dataverse repositories, like Zenodo and EASY.
It won't let Dataverse harvest from repositories like Figshare, which as I wrote earlier puts "10.6084/m9.figshare.12725075.v1" in the dc:identifier element. I imagine it would be harder to figure out if a string like "10.6084/m9.figshare.12725075.v1" is a DOI or HDL, as opposed to a string that starts with http/s://doi
, doi:
, https://hdl.handle.net
, or hdl:
. This is part of why I wanted to know why this restriction was created in the first place.
And since you wrote that the large effort of removing the restriction entirely isn't justified, I'm okay with that maybe being reconsidered if there's ever a request to harvest records from other repositories like Figshare.
@JingMa87, would you be able to submit a PR?
To harvest SRDA metadata into Harvard Dataverse (https://dataverse.harvard.edu/dataverse/srda_harvested), we have to use DataCite's OAI-PMH feed (https://oai.datacite.org/oai - the harvesting set is GESIS.SRDA).
I don't get any sets, see the screenshot below.
The first time I tried, it only returned like 50 sets and I received an error message that said the response took too long.
Anyhow, I've seen the "10.6084/m9.figshare.12725075.v1" handle from the Figshare repo in some other repo's too. In case you don't know whether it's a hdl or a DOI, the hdl proxy should always work.
I did a few tests and in all my cases a DOI proxy also works.
So in any case we could resolve these handles with a hdl URL. I emailed doi.org to find out if their proxy also always works since we prefer DOI. How do you feel about also allowing handles without hdl or doi in them and constructing them as a hdl or doi?
How do you feel about also allowing handles without hdl or doi in them and constructing them as a hdl or doi?
I had no idea that https://hdl.handle.net/10.6084/m9.figshare.12725075.v1 and https://doi.org/10.6084/m9.figshare.12725075.v1 would point to the same resource. I don't understand it, but that's pretty cool! Could you write more about what you mean when you write that "we could resolve these handles with a hdl URL"?
If Dataverse harvests a record with a dc:identifier that doesn't start with http/s://doi
, doi:
, https://hdl.handle.net
, or hdl:
, like 10.6084/m9.figshare.12725075.v1, you're saying that Dataverse could still know that the identifier is a DOI or HDL?
Could you write more about what you mean when you write that "we could resolve these handles with a hdl URL"?
By this I just mean constructing "https://hdl.handle.net/10.6084/m9.figshare.12725075.v1" from "10.6084/m9.figshare.12725075.v1". You can either prepend https://hdl.handle.net/ or https://doi.org/ to the handle.
If Dataverse harvests a record with a dc:identifier that doesn't start with http/s://doi, doi:, https://hdl.handle.net, or hdl:, like 10.6084/m9.figshare.12725075.v1, you're saying that Dataverse could still know that the identifier is a DOI or HDL?
The point is that we don't know whether it's a DOI or HDL but any handle should work with https://hdl.handle.net/ prepended to it. So I could just make a PR that:
@jggautier I tested the fix locally and managed to successfully harvest the Figshare repo with the format oai_dc and the set portal_895.
Awesome!
In the PR I see if (otherId.startsWith("10.") && otherId.contains("/"))
, so that's how Dataverse will guess that a dc:identifier is a hdl or doi when it doesn't start with http/s://doi
, doi:
, https://hdl.handle.net
, or hdl:
.
Looks like the PR covers all of the non-Dataverse repositories we listed earlier, including Zenodo. Would you say it's ready for code review?
@jggautier Totally ready!
Hi, interesting discussion.
10.
is neither a DOI nor a plain handle, but some other identifier? @janvanmansum Are you aware of any identifiers that start with 10.
, have a /
in them and are neither a DOI or a handle? I've checked multiple repo's (Easy, Zenodo, Figshare, Dataverse) and didn't find any.
@JingMa87 that might be a good question for https://www.pidforum.org 😄
@pdurbin There it is: https://www.pidforum.org/t/how-do-i-determine-whether-a-string-is-a-valid-handle/1122
@pdurbin I found a way to programmatically resolve a handle and will update the PR.
When I try to add zenodo (https://zenodo.org/oai2d) as harvesting client on dataverse 4.8.6, 4.9.1 and 4.9.2, I get the error :
"Failed to find a global identifier in the OAI_DC XML record"