IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
881 stars 493 forks source link

ORCID integration in Dataverse #4236

Closed philippconzett closed 2 months ago

philippconzett commented 7 years ago

Some of the researchers depositing their data in our Dataverse installation use different author name variants, e.g. Eckhoff, Hanne vs. Eckhoff Hanne M. In some cases they prefer to use the same variants as in the article publication that is based on the dataset. These inconsistency in author name makes it somewhat difficult to use the author field for searching and filtering. The "same" author (with different name variants) appears several times in the Dataverse search, and, I guess, also in other search engines that harvest DataCite.

Dataverse already provides an ORCID field in the Citation Metadata section. But as far I can see, this field is not available for search/filtering in a user friendly way through GUI. I suggest that ORCID should be used in future versions of Dataverse to enable unique searching and filtering for author names in a user friendly GUI.

See also this discussion on Dataverse Google Group.

In addition, it should be possible to make the ORCID field in the Citation Metadata section pre-filled using the ORCID field in account information.

Best, Philipp

Richardcwynne commented 6 years ago

One of the best ways to encourage ORCID adoption and the association of data with ORCID ID is to enable ORCID federated sign-on. This is similar to Google/FB sign-on. A growing number of Open Science systems have adopted this approach so that researchers can use use SSO across different systems and platforms - one example is seen here https://www.ariessys.com/views-and-press/resources/video-library/orcid-single-sign-on/

pameyer commented 6 years ago

@Richardcwynne Are you thinking of something different than the ORCID authentication that Dataverse already supports?

Richardcwynne commented 6 years ago

Wow you have done it already! Fantastic!! Sorry I missed that. Richard.

pdurbin commented 6 years ago

@Richardcwynne no worries! Please see "Dataverse supports three OAuth providers: ORCID, GitHub, and Google" at http://guides.dataverse.org/en/4.9/installation/oauth2.html

mheppler commented 6 years ago

The authentication with ORCID was delivered in 4.6.1, but this issue requests additional features. As @philippconzett pointed out, there was a discussion in the Dataverse Users Community Google Group, What else does ORCID Integration give you besides Login?.

@Richardcwynne any other suggestions for ORCID related features are welcome. If this specific GitHub issue does not cover a use case of yours, please feel free to open a new issue describing it.

RightInTwo commented 6 years ago

I just want to emphasize that this issue is about retriving ORCID information for a third party to be used in the metadata, not OAuth authentication, as @philippconzett and @mheppler pointed out.

We also talked about ORCID as an example source when I visited this week's tech hour. It was about #4772, regarding the referencing of well-defined objects (people, vocabulary terms, or any other type of object through lists generated by external API calls) in the metadata.

It seemed unpratical to query for the name and institution of each returned search result (complexity n instead of 1) by using the Fetch record details API. The according discussion in the orcid-api-users google group does not offer a solution to this by just using the search API. It might make sense to cache the results (as the keep narrowing down) until the ORCID API is improved to return ID, name and institution (to make sure the match is correct) for the search API call: https://pub.sandbox.orcid.org/v2.1/search/?q=John-Doe ( -> use header "Accept: application/json").

I hope that the use of ORCID to find a person for a metadata field can be solved by a generalized way to reference data from API sources in the metadata, as described in the to be added issue.

jggautier commented 5 years ago

I'm wondering if this issue could be split into two or three, even though they're very related.

Using ORCID IDs to search for dataset authors

For the first feature @philippconzett wrote about, making it easier for users to search for dataset authors using dataset author ORCID IDs, I interpret this a few ways.

You can search for ORCID IDs using the basic search box, but you need to add quotes around the ID numbers.

Screen Shot 2019-07-08 at 11 39 15 AM

If you search without adding quotes, the results make me think that the search engine is treating the hyphens between each group of numbers like spaces and searching for four strings instead of one. For example, if you search for 0000-1111-2222-3333, the results will include datasets that have only 0000 in its metadata (and the search engine isn't considering results that have the entire string as most relevant). Could something be done, besides adding quotes, to make the search treat the whole ID as one string or treat results that have the whole string as most relevant?

If we wanted to add the author identifier field to the advanced search, wouldn't we just need to edit the citation.tsv file so that for authorIdentifier, advancedSearch is TRUE?:

Screen Shot 2019-07-08 at 10 52 02 AM

(And then just follow the other steps in the Metadata Customization guide, e.g. loading tsv file, updating solr schema?)

Pre-filling the citation block's author identifier field with ORCID IDs

In addition, it should be possible to make the ORCID field in the Citation Metadata section pre-filled using the ORCID field in account information.

When I log into Dataverse using my ORCID account and then create a dataset, my ORCID ID in my Dataverse account is also pre-filled in the dataset's author identifier field.

ORCID-ID-prefilled

I think @RightInTwo's comment is more about Dataverse using metadata it already has, like author name, to recommend ORCID IDs by pulling those IDs from an external source (ORCID's database). So when I create a dataset and add an author name, Dataverse suggests ORCID IDs that might belong to that author and that I can then add to the author identifier field. Is that how this would work from the depositor's perspective?

RightInTwo commented 5 years ago

I think @RightInTwo's comment is more about Dataverse using metadata it already has, like author name, to recommend ORCID IDs by pulling those IDs from an external source (ORCID's database). So when I create a dataset and add an author name, Dataverse suggests ORCID IDs that might belong to that author and that I can then add to the author identifier field. Is that how this would work from the depositor's perspective?

I think "metadata it already has" is a bit misleading, because at the point of adding authors, Dataverse does not necessarily have any metadata yet. But otherwise, yes!

pdurbin commented 5 years ago

I'm wondering if this issue could be split into two or three, even though they're very related.

Yes. Absolutely. Probably even more. The smaller the better. Small items move across the board faster.

You can search for ORCID IDs using the basic search box, but you need to add quotes around the ID numbers.

We could probably fix this by changing text_en to string as in the example below. The field is called "authorIdentifier". I thought about these quotes recently when replying at https://groups.google.com/d/msg/dataverse-community/5cdmqhv-Qdo/AJn0XxIoCAAJ . I agree that it would be nice not to require them.

$ git diff conf/solr/7.3.1/schema.xml
diff --git a/conf/solr/7.3.1/schema.xml b/conf/solr/7.3.1/schema.xml
index deabc789e..d5c3275e3 100644
--- a/conf/solr/7.3.1/schema.xml
+++ b/conf/solr/7.3.1/schema.xml
@@ -247,7 +247,7 @@
     <field name="astroType" type="text_en" multiValued="true" stored="true" indexed="true"/>
     <field name="author" type="text_en" multiValued="true" stored="true" indexed="true"/>
     <field name="authorAffiliation" type="text_en" multiValued="true" stored="true" indexed="true"/>
-    <field name="authorIdentifier" type="text_en" multiValued="true" stored="true" indexed="true"/>
+    <field name="authorIdentifier" type="string" multiValued="true" stored="true" indexed="true"/>
     <field name="authorIdentifierScheme" type="text_en" multiValued="true" stored="true" indexed="true"/>
     <field name="authorName" type="text_en" multiValued="true" stored="true" indexed="true"/>
     <field name="characteristicOfSources" type="text_en" multiValued="false" stored="true" indexed="true"/>

If we wanted to add the author identifier field to the advanced search, wouldn't we just need to edit the citation.tsv file so that for authorIdentifier, advancedSearch is TRUE?:

Judging from https://github.com/IQSS/dataverse/blob/v4.15/src/main/java/edu/harvard/iq/dataverse/search/AdvancedSearchPage.java#L70 I think so but I haven't tried it.

Pre-filling the citation block's author identifier field with ORCID IDs

As someone who logs in to Harvard Dataverse but who has an ORCID ID (without an "x" @philippconzett :smile: ) I'd love to be able to add it to my user profile. Currently, this only happens for people who log in with ORCID.

I don't believe the following issue has been mentioned yet but it's an integration I think we should consider: update users' ORCID record on dataset publication #3490

qqmyers commented 5 years ago

FWIW, on the SEAD project we created an input widget that allowed users to start typing name, email, or ORCID digits and we supported autocomplete for any we had seen before, storing the ORCID as the value, but displaying name as a link to the ORCID page, and showing the email as a pop-up (iff the email in the ORCID profile was public). For new ORCIDs (ones we hadn't seen), you had to type them in but, once used, we queried ORCID to be able to display name,email, ORCID. This made it reasonably useful without us having to provide search over all ORCIDs. It's possible that most of this could be packaged into a generic javascript library, but a service to query by name, email (if public), or ORCID to populate the auto-complete list is needed on the back end. If anyone wants a clearer explanation or is interested in trying to implement something like this in Dataverse, let me know.

pdurbin commented 5 years ago

@qqmyers I think your explanation is clear enough and it sounds like a great feature! Now we just need someone to code it up. 😄

@philippconzett I still agree that this issue should perhaps be broken into smaller, more clearly defined issues. We use the term "small chunks" for this. Small chunks move more quickly across our board: https://github.com/orgs/IQSS/projects/2 😄 Or maybe you could simply adjust the title of this issue to make it more specific?

philippconzett commented 4 years ago

@pdurbin Makes sense to me to break up this issue into smaller ones. I think, I'll concentrate on the part that is about populating the ORCID field with the value stored in the Account Information. Before I create a new issue or rename this one, I'd like to ask some questions.

@jggautier explains above how the ORCID field in a dataset is automatically filled in based on the the Account Information. That's nice! But in my Account Information, there is no ORCID field. This is probably because this information is fetched from our SSO provider? I cannot edit the Account Information in Dataverse either. When I look at the Account Information in my locally created test account on demo.dataverse.org, I cannot find the ORCID field either. So, how does one get the ORCID information into the Account Information in the first place? Is this only available when one uses sign-up / sign-in via ORCID?

poikilotherm commented 4 years ago

From what I know about the retrieval of the ORCID, this is only implemented when you login via ORCID for now. For everything more advanced, one needs to enhance the ORCID API integration. See also #6329, #5974, #5689 and #5279.

pdurbin commented 4 years ago

So, how does one get the ORCID information into the Account Information in the first place? Is this only available when one uses sign-up / sign-in via ORCID?

Yes, the ORCID ID is only stored for people who have authenticated with ORCID when logging into Dataverse. I believe it's stored in the persistentuserid column of the authenticateduserlookup table.

philippconzett commented 4 years ago

Thanks, @poikilotherm + @pdurbin. Would it possible to get the ORCID into the Dataverse Account Information also when signing up via institutional SSO? If an institution provides ORCIDs for all its researchers, it would be nice to get this information into Dataverse along other information such as affiliation.

pdurbin commented 4 years ago

@philippconzett I like to say that anything is possible with code. 😄 There are a few steps:

In short, the feature hinges on how strict you want to be about confirming that an individual "owns" the ORCID ID they say they do (like confirming an email address).

I hope this helps.

poikilotherm commented 4 years ago

For such integrations ORCID offers API endpoints for members. You can retrieve data from ORCID (or send yours) from (to) the users profile there via their XML based REST API.

This needs OAuth support on our side, but would ensure that we receive the ORCID from a trusted and validated source (vs any random human errors with such long IDs...).

sheilarabun commented 4 years ago

Hi all, I was looking into the status of ORCID within Dataverse and came across this issue. Great to see the discussion and consideration. Indeed, the best practice is to gather authenticated ORCID iDs from individuals via OAuth, rather than allowing for manual entry which has room for error. The ORCID public API can be used to gather authenticated IDs and read public data from ORCID. The ORCID member API allows for the same but also for writing data to people's ORCID records. It all depends on the scopes that are used in the ORCID auth URL.

Note that currently institutions can not provide ORCID iDs for their affiliated researchers, only individuals can register for their own ORCID iDs. So, to get ORCID iDs into the system you would really need the individuals to connect their ORCID iD via OAuth .

philippconzett commented 4 years ago

@sheilarabun: Could you please explain how researchers can "connect their ORCID iD via OAuth"?

sheilarabun commented 4 years ago

@philippconzett Yes absolutely! Aside from the below explanation, ORCID has more detailed documentation, initially: https://members.orcid.org/api/oauth https://members.orcid.org/api/tutorial/get-orcid-id https://members.orcid.org/api/integrate/create-records

From the researcher/user perspective: 1) Regardless of how the user logged in to Dataverse, the idea is that there would be a button somewhere on their user dashboard or profile prompting them to "register or connect your ORCID iD". Example: image (behind this button is the ORCID auth URL containing the API client ID, scopes, and redirect URI. For example: https://orcid.org/oauth/authorize?client_id=APP-NPXKK6HFN6TJ4YYI&response_type=code&scope=/read-limited/activities/update&redirect_uri=https://bc.edu/orcid)

2) Upon clicking the button, the user will be taken to the ORCID login page, where they either log in to their ORCID record, or register for an ORCID iD if they don't already have one.

3) Once logged in, the ORCID authorization screen appears, which is basically the client asking the user for permission to connect with their ORCID iD - there are a few different permissions (aka scopes) that can be asked for - the scopes from the auth URL are reflected here. For example, if you want to just get the user's authenticated ORCID iD, you would use the /authenticate scope. If you wanted to write data to the user's ORCID record (using the Member API as opposed to the Public API), you would use the /activities/update scope and/or the /person/update scope. All of the scopes are defined here: https://members.orcid.org/api/oauth/orcid-scopes This example is showing what would appear if you used the /read-limited scope and the /activities/update scope: image

4) When the user clicks "authorize", the client will receive a handful of data via the API that will need to be stored in a secure database - for example:

The access token is what would allow subsequent API calls to either import data from or write data to the person's ORCID record. Here is a list of all of the possible data points that could be included on an ORCID record: https://www.lyrasis.org/Leadership/Pages/ORCID-Data-Fields.aspx

You can try out the basic OAuth process in the ORCID sandbox here: https://members.orcid.org/api/oauth/presenting-oauth#try-it

One option that might make sense, is to have Dataverse be an ORCID service provider, where institutions that are using their own installation of Dataverse and are also organizational ORCID members could use their own ORCID member API credentials to enable this functionality. I'm happy to chat more if there are additional questions.

philippconzett commented 4 years ago

Thanks, @sheilarabun, for this in-depth explanation! In our Dataverse-run repository (DataverseNO), we basically have two type of users:

  1. Researchers at Norwegian research organizations using a national SSO service called Feide, which I think is based on / supports OAuth
  2. Other researchers

A. Researchers in group 1 we want to sign up / log in through Feide. B. Researchers in group 2 we want to sign up / log in through ORCID. (Currently we create accounts for them manually.) C. For researchers of both groups we want their ORCID to be imported / accessible in Dataverse, e.g. as pre-filled value in the metadata field Author - Identifier.

DataverseNO has already implemented A. If I'm not mistaken, B is also possible in Dataverse. I guess for DataverseNO, we would have to combine B with some other process where researchers of group 2 must specify which collection / sub-dataverse they need to have deposit access to. As for C, I'm not sure if this would work the same way for researchers of both group 1 and 2.

cmbz commented 2 months ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.