The first step was to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done. Leonid's conclusions at the conclusion of Issue 8574 does that.
This Epic tracks the work as proposed by Leonid:
I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt.
(Note that I'm interpreting the word "prioritizing" as assigning some order of importance to these issues and bugs, what makes sense to fix first and/or what's ready to be worked on vs. what needs more discussion; not as scheduling them for specific sprints, etc.!)
The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too)
The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.
the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:
The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):
There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:
[ ] https://github.com/IQSS/dataverse/issues/8575
It is obviously super important and should be prioritized too; but it's more of a design discussion and research, rather than something we already can start coding.
The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with.
Improved Harvesting via the OAI-PMH standard
Improved support for Bagit
Improved support for Signposting
The scope for this issue is Harvesting via the OAI-PMH standard
Aim 4:
Improve harvesting and packaging standards to share metadata and data across repositories
Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.
A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.
To help with this, we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data.
Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.
Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.
Not finished defining. It seems like some of the work may apply to other NIH objectives that Len has mentioned. A
Next steps:
I left off at the start of this paragraph: "The following issues are about the DDI exporter producing XML ..." All of the issues prior to that now have at a minimum the labels: Feature: Harvesting, NIH OTA DC, pm.epic.nih_harvesting
Figure out where these fit. Should the be part of another epic for the NIH grant?
The first step was to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done. Leonid's conclusions at the conclusion of Issue 8574 does that.
This Epic tracks the work as proposed by Leonid:
I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt. (Note that I'm interpreting the word "prioritizing" as assigning some order of importance to these issues and bugs, what makes sense to fix first and/or what's ready to be worked on vs. what needs more discussion; not as scheduling them for specific sprints, etc.!)
The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too)
The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.
the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:
The following issues are about the DDI exporter producing XML that is not valid under the schema.
Similarly, the following issues are requests for changes in how we export DC; I believe these need to be reviewed/discussed, perhaps together?
The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):
There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:
The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with.
More Background:
This is in support of:
an NIH grant "The Harvard Dataverse repository: A generalist repository integrated with a Data Commons", Aim 4: Improve harvesting and packaging standards to share metadata and data across repositories,
There is a lot packaged into Aim 4
Improved Harvesting via the OAI-PMH standard Improved support for Bagit Improved support for Signposting The scope for this issue is Harvesting via the OAI-PMH standard
Aim 4:
Improve harvesting and packaging standards to share metadata and data across repositories
Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.
A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.
To help with this, we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data.
Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.
Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.
Related documents
Notes on Dataverse Deliverablas for NIH OTA NIH OTA Progress Notes NIH OTA Exposing and harvesting metadata using the OAI metadata harvesting protocol: A tutoria (2001) Getting Started with BagIt in 2018 NIH OTA bagit from Library of Congress video