IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 490 forks source link

As an installation admin, I want my repository to export OpenAIRE-compliant metadata to improve discoverability, reusability of research data #4257

Closed jggautier closed 5 years ago

jggautier commented 6 years ago

@philippconzett (Dataverse Network Norway) wrote in https://groups.google.com/forum/#!msg/dataverse-community/lgSTeI-0zkQ/R7W8CfzvAAAJ:

The EU-sponsored research infrastructure project openAIRE aims to promote open scholarship and substantially improve the discoverability and reusability of research publications and data. Their guidelines have by now gained status as de-facto standards for OA research publication and data providers. In their Guidelines for Data Archives, they state i.a. what kind of metadata information research data archives should provide. ... For us, and I guess for other Dataverse installations/users in Europe, compliance with the openAIRE guidelines is important. So, I wonder whether information about access and license(s) could be complemented in a new version?

@juancorr shared in another issue about adding DataCite metadata to the Export Metadata pulldown that Dataverse e-cienciaDatos

has expanded its DataCite metadata to be compliant with the European OpenAIRE guidelines (https://guidelines.openaire.eu/en/latest/data/index.html)...

If you want develop this feature we can collaborate.

The definition of done for this issue will be a Dataverse admin being able to have OpenAIRE harvest OpenAIRE-compliant metadata from her installation.

juancorr commented 6 years ago

We have use some ugly tricks to have the OpenAIRE compatibility because Dataverse has not all metadata that need OpenAIRE. You can see them in the file https://github.com/Consorcio-Madrono/dataverse/blob/v4.6WithOpenAIRE/src/main/resources/templates/datacite_40.ftl .

pdurbin commented 6 years ago

datacite_40.ftl

This .ftl file must be an FreeMarker file. I see the dependency has been added to the pom.xml at https://github.com/Consorcio-Madrono/dataverse/blob/025df77e0a25a8ad9221fec61925af88ed09053a/pom.xml#L57 . Perhaps this would be better discussed at https://groups.google.com/forum/#!forum/dataverse-dev (please feel free to start a thread there if you like, @juancorr ) but I'm curious about why you've introduced FreeMarker into your branch and if there is any alternative that's already part of the Java EE standard. I'm not trying to criticize. I'm just curious. I've never used FreeMarker.

juancorr commented 6 years ago

We have used the sbgrid code as base (https://github.com/sbgrid/sbgrid-dataverse/tree/feature/datacite-xml). We only have patched these code, the Dataverse code and adapted the inital sbgrid FreeMarker file to have a valid DataCite XML code and accomplish OpenAIRE guidelines. It is the first time that I use a FreeMarker file too, but it is easily adaptable to accomplish other institutions requirements and to have special cases out of the java code. This works very well with e-cienciaDatos, but we have 12 datasets. We have not tested it in a large Dataverse installation. Sorry, I have not enough experience with this files to discuss about it.

pdurbin commented 6 years ago

@juancorr oh! So you weren't the one to add the FreeMarker dependency. It's from the SBGrid branch. Thanks. I understand now.

juancorr commented 6 years ago

Yes, I had said it in my first comment in https://github.com/IQSS/dataverse/issues/3697 , but I should have emphasized it.

abollini commented 6 years ago

Dear all, I’m glad to announce that our proposal to enhance the interoperability of several open source platforms has been awarded by OpenAIRE, see https://www.4science.it/en/2018/02/23/4science-awarded-by-openaire/ In our proposal, we have included the implementation of the Data Repository Guidelines in Dataverse, more specifically the support for the datacite schema 4.1, to be ready for the new version of the guidelines that are expected soon. We have just found this thread, I’m really happy to see our assumptions about the benefit of this development confirmed by the community and I will be happy to contribute to develop a general solution that works for all and hopefully can be included by default in a next Dataverse version

pdurbin commented 6 years ago

@abollini that's great news! Can you please also start a new thread about this at https://groups.google.com/forum/#!forum/dataverse-community to spread the word? Thanks!

pdurbin commented 6 years ago

@abollini thanks for posting https://groups.google.com/d/msg/dataverse-community/OALTzINxkX0/v_WwJ4cvAwAJ ! Also, I mentioned your proposal in the Dataverse Community News yesterday: https://groups.google.com/d/msg/dataverse-community/AlZHT6tQM3U/0RrMUOv1AgAJ

Next it would be great to get a shared understanding of what you think the pull request will look like, what the scope of change will be. To get on the same page literally, it would be nice to have a Google doc or similar for what you have in mind. For now I'm linking to this issued in the "Dev Efforts by the Dataverse Community" spreadsheet at https://docs.google.com/spreadsheets/d/1pl9U0_CtWQ3oz6ZllvSHeyB0EG1M_vZEC_aZ7hREnhE/edit?usp=sharing but please feel free to create new issues as needed if you want to divide the work into smaller chunks. In our experience, smaller chunks move more easily across our kanban board at https://waffle.io/IQSS/dataverse

In short, please let us know if there is anything you need!

abollini commented 6 years ago

We have created a PR with the result of our development: https://github.com/IQSS/dataverse/pull/4664/ we will be happy to receive feedback and improve it as needed

pdurbin commented 6 years ago

@abollini hi! Thanks for the pull request! I just advanced it to Code Review at https://waffle.io/IQSS/dataverse and left you a review.

@juancorr are you interested in giving a review as well?

juancorr commented 6 years ago

Thanks Philip, yes I am very interested. I will review it.

Juan Corrales

2018-05-14 2:43 GMT+02:00 Philip Durbin notifications@github.com:

@abollini https://github.com/abollini hi! Thanks for the pull request! I just advanced it to Code Review at https://waffle.io/IQSS/dataverse and left you a review.

@juancorr https://github.com/juancorr are you interested in giving a review as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/4257#issuecomment-388668677, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT5CD1fkBu1ZqMjf69lOt1NmPOkEtYvks5tyNM4gaJpZM4QQ_o5 .

djbrooke commented 6 years ago

Great! Thanks @abollini and team for the PR, @pdurbin for the feedback, and @juancorr for taking a look!! :) I'll move this to Inbox column on our Waffle board for now, as it's a large PR there's already some feedback and community review offers.

pdurbin commented 6 years ago

@abollini any news? Are you blocked? Do you need anything? @juancorr and I have been chatting a bit in IRC if you'd like to join us some day. 😄

pdurbin commented 6 years ago

In 4b28306 I added "DataCite OpenAIRE" to the list of export formats. @djbrooke and I just spoke about how tests would be nice but they're tricky for external developers to write so I went ahead and moved this issue (and #3697) to QA.

kcondon commented 6 years ago

I haven't begun testing yet but during a test deployment, found that OpenAire was not appearing in export list and this error is in server log: Could not find key "dataset.exportBtn.itemLabel.dataciteOpenAIRE" in bundle file.

pdurbin commented 6 years ago

@kcondon good catch. Fixed in 7c11bc0. Here's how it looks:

screen shot 2018-05-30 at 4 27 43 pm

pdurbin commented 6 years ago

@abollini @lap82 @francescopioscognamiglio @juancorr please take a look at the tests I added as of 5336e67. As of this writing OpenAireExportUtil.java, for example, has 51.79% code coverage, up from 0%. 😄 Here's how it looks in Netbeans:

screen shot 2018-05-31 at 9 05 15 am

juancorr commented 6 years ago

Thanks @pdurbin , I have just starting my war with code coverage tools (Ok, NetBeans is a good ally), I did not know it. I will see the tests. What is the right method to suggest more tests?.

juancorr commented 6 years ago

@pdurbin @abollini @lap82 @francescopioscognamiglio I have added some new tests and have found two little bugs in openAIRE code related to geolocalization and the alternative title. Should I open a pull request to @abollini code for bugs and another pull request to main develop Dataverse branch for tests?.

pdurbin commented 6 years ago

@juancorr we try to work in small chunks so multiple pull requests sounds better. Thanks!

jggautier commented 6 years ago

Hi everyone,

Is there a crosswalk or any documentation I could peak at for this PR? It's really cool being able to poke at this work, but might be helpful if there's a crosswalk or something explaining how fields are being mapped.

For now, here are other potential problems I've seen with the OpenAIRE metadata in the PR as of last week. I'm not sure how important it is to fix many of these problems for this github issue, but I would argue that at least the first is considered and fixed:

I hope this helps!

jggautier commented 6 years ago

Since the purpose of this pull request is mainly to get Dataverse to export OpenAIRE complaint metadata so that OpenAIRE can harvest it, I'm adding OpenAIRE's validator page, https://www.openaire.eu/validator/welcome, which also includes a link to register your repository.

pdurbin commented 6 years ago

@jggautier I thought #4318 was about harvesting. This issue is about export.

pdurbin commented 6 years ago

@jggautier @kcondon and I just talked this out. @jggautier is going to work on figuring out what work remains before this issue about OpenAIRE goes to QA.

juancorr commented 6 years ago

Thanks @jggautier,

I will try answer some points related to OpenAIRE compatibility. I hope can explain it in English.

Development done with #4318 allow Dataverse be compatible with OpenAIRE 4.0 guidelines which are in DRAFT version yet, but compatible dataverses or Dataverse installations should fill all required OpenAIRE metadata. I think that this development is compatible with current guidelines, but I have not checked it yet.

pdurbin commented 6 years ago

@abollini @lap82 @francescopioscognamiglio please note that @juancorr has made a pull request against your pull request at https://github.com/4Science/dataverse/pull/4 to add some more tests.

jggautier commented 6 years ago

Thanks @juancorr. I'm hoping we can use @abollini's Google Groups thread to get a shared understanding of the scope of this issue, which will be helpful when it comes time to test this PR. Everyone who's interested, please feel free to add your thoughts. Thanks!

pdurbin commented 6 years ago

I just read through post by @jggautier above and it's a great summary of the conversation he, @kcondon and I had yesterday. @abollini @lap82 @francescopioscognamiglio @juancorr please take a look and let's talk about the scope of the pull request and how much more development needs be done before we advance it from code review to QA. Thanks! Others are welcome to comment as well, of course!

pdurbin commented 6 years ago

@abollini what do you think?

djbrooke commented 6 years ago

I need to get up to speed on this with @jggautier early this week, post Community Meeting. :)

@abollini @lap82 @francescopioscognamiglio @juancorr we should have some feedback soon.

jggautier commented 6 years ago

Thanks, @djbrooke, for discussing with me. We're looking forward to getting @abollini's input on:

djbrooke commented 6 years ago

Thanks @jggautier, moving back to Development until this feedback is implemented or responded to.

djbrooke commented 6 years ago

Hey @abollini - any news? Let us know if there's anything we can do. Thanks!

abollini commented 6 years ago

hi all, sorry for the delay. We will try to reply to your comments by the end of next week at latest

pdurbin commented 6 years ago

@abollini hi! Any news?

pdurbin commented 6 years ago

Last week @jggautier indicated he's interested in trying something on a running server with the openaire branch on it. This morning I pinged @juancorr at http://irclog.iq.harvard.edu/dataverse/2018-07-16#i_70126 and he's going to set up a server for testing soon. Thanks!

While I'm writing, any news, @abollini ?

pdurbin commented 6 years ago

I've been out for a week. Any news on this issue? I see @jggautier left a longish comment at https://github.com/IQSS/dataverse/pull/4664#issuecomment-405722192 but that was three weeks ago.

pdurbin commented 5 years ago

@jggautier when you get a chance can you please summarize the status of this issue?

djbrooke commented 5 years ago

Moving to the inbox until there's additional work on this.

pdurbin commented 5 years ago

I just noticed that @fcadili resolved the merge conflicts in pull request #4664. Thanks!

Does that mean you are ready for code review? Please let us know how we can help. 😄

pdurbin commented 5 years ago

@jggautier I spun up the branch (openaire-103925a) at http://ec2-100-27-31-230.compute-1.amazonaws.com:8080 if you'd like to poke around. The password is "admin1".

jggautier commented 5 years ago

Thanks @pdurbin! I'm trying to see the exported OpenAIRE metadata for a dataset, but when I try to export it, or export any metadata really, I get a "This site can’t be reached" page. Is it possible to export the OpenAIRE metadata?

pdurbin commented 5 years ago

@jggautier whoops! My fault! I hadn't configured dataverse.siteUrl. http://ec2-100-27-31-230.compute-1.amazonaws.com:8080/api/datasets/export?exporter=oai_datacite&persistentId=doi%3A10.5072/FK2/G251YB should now work, which is a link from Export Metadata at http://ec2-100-27-31-230.compute-1.amazonaws.com:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/G251YB

fcadili commented 5 years ago

Yes, I'm working on it. I'm double checking to have applied the received feedback and I will comment on the PR about it that soon. Thanks for reviewing it.

pdurbin commented 5 years ago

@fcadili great! I just invited you to join https://github.com/orgs/IQSS/teams/dataverse-readonly/members . If you're ok being assigned to this issue, I'll move it from "Inbox" to "Community Dev" at https://waffle.io/IQSS/dataverse

djbrooke commented 5 years ago

Thanks @fcadili for the updated PR. I'm assigning myself and @jggautier so that we can check out what was implemented from a metadata perspective. We may have some questions, but after that we'll move it along so a developer can review it. Thanks again!

jggautier commented 5 years ago

Thanks @fcadili! The concerns I had about funder, language and rightsList metadata seem resolved. Looks great!

The rules being used for figuring out the creator "nametype" seem to have changed. They seem to be:

  1. if Identifier Scheme is set to ORCID and there's a value in Identifier, "nametype" is set to "Personal"
  2. if there's an affiliation, "nametype" is set to "Personal"
  3. otherwise, "nametype" isn't used

The first rule is great I think, since it seems that ORCID is intended for only researchers. But I think the second rule will result in a lot of creators being tagged as "personal" when they're not. I see a lot of datasets in Dataverse repositories (and in non-Dataverse repositories harvested by Harvard Dataverse, like ICPSR and ODESI) where the author is an organization, and the affiliation field contains another organization, like the organization's host institution.

Sending metadata that indicates that an author is a person or an organization seems to be important (e.g. https://github.com/IQSS/dataverse/issues/5029, studies being done into authorship decisions, generating citations in different styles). I just don't know how tolerant of miscategorized creators we should be. DataCite uses an algorithm that we're told is right about 90% of the time.

djbrooke commented 5 years ago

Moving back to Community Dev for now. @fcadili let us know your thoughts on the above!

fcadili commented 5 years ago

I'm working on creator nametype in order to apply DataCite algorithm described in https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313. When done I will comment on the PR about it. Thanks for reviewing it.

jggautier commented 5 years ago

Thanks @fcadili. I saw the latest comment in your PR (https://github.com/IQSS/dataverse/pull/4664#issuecomment-484387154) about using that algorithm. Moving this to code review.