IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
875 stars 484 forks source link

Feature Request/Idea: Make OAI-PMH harvesting more configurable #10677

Open philippconzett opened 1 month ago

philippconzett commented 1 month ago

Overview of the Feature Request The idea is to make it possible to make OAI-PMH metadata harvesting more configurable, so that 1) the metadata about the datasets included in a given harvesting set can come from any selection of fields from any metadata schema defined in a Dataverse installation, and 2) the metadata can be based on other standards than Dublin Core (DC). See discussion in Dataverse Users Community Google Group.

What kind of user is the feature intended for? API User, Superuser, Sysadmin

What inspired the request? DataverseNO would like to implement interoperability support for data to be made searchable and reusable through the Svalbard Integrated Arctic Earth Observing System (SIOS), which is an international observing system for long-term measurements in and around the Norwegian archipelago of Svalbard addressing Earth System Science questions. There is a growing community in Europe and beyond who makes or is interested in making their data reusable through SIOS. Currently, SIOS only supports harvesting of discovery metadata using OAI-PMH.

What existing behavior do you want changed? Currently, Dataverse supports OAI-PMH harvesting using a DC representation of (some of?) the metadata in the Citation Metadata block.

Any brand new behavior you want to add to Dataverse? Yes, the requested feature would extend the possibility of how to configure OAI-PMH metadata harvesting.

Any open or closed issues related to this feature request? Some of the issues below might be related:

IQSS/dataverse:

IQSS/dataverse-pm:

philippconzett commented 1 month ago

@pdurbin and I were having a chat about this issue on Zulip, but found out it would be good to share the conversation in a public channel, so I've pasted it below:

Philip Durbin:

It is important to be able to configure which fields can be harvested or is simply "all fields" (from all metadata blocks) sufficient?

Philipp Conzett:

I'm not sure. I think SIOS would need that we provide a OAI-PMH harvesting set which exposes metadata according to the XML Schema of the GCDD DIF standard about all relevant datasets in DataverseNO. But maybe it's possible to do some sort of selection and/or mapping of relevant fields based on what Dataverse exposes through OAI-PMH?

Philipp Conzett:

I've uploaded an example XML file about one dataset in SIOS (the file extension of the original file is .xml, but GitHub wouldn't allow me to upload .xml files): NPI_4e28fed2-cf18-52e8-8370-744ca8a4c7cf_dif10.txt

Philip Durbin:

Does SIOS use OAI-PMH to harvest this GCMD DIF format from any other data repositories? Or would Dataverse be the first?

Philipp Conzett:

SIOS supports OAI-PMH harvesting based on two metadata standards, GCMD DIF and ISO 19115. On the SIOS Data Portal page, I see in the right filter section that there are about 20 data centers being harvested. I don't know how many of these use GCMD DIF, but let's say half of them do.

Philip Durbin:

Interesting. I don't know what ISO 19115 is but is that also an option? I found https://en.wikipedia.org/wiki/Geospatial_metadata#ISO_19115:_Geographic_information_%E2%80%93_Metadata I bet Amber and others who are into geospatial data would like this.

Philipp Conzett:

Based on what I've found out about the two standards, GCDD DIF seems easier to implement.

Philip Durbin:

"Easier to implement" sounds good.

johannes-darms commented 1 month ago

@philippconzett we are also interested in this feature.

Could we implement something similar to the Exporter SPI, i.e. add custom modules (Importers) responsible of the transformation of a harvested metadata format into corresponding metadatablocks?

cc:@vera @julian-schneider

qqmyers commented 1 month ago

Per https://github.com/IQSS/dataverse/blob/54767b97c25f8f9e3fd14e6844177397622eccc5/src/main/java/edu/harvard/iq/dataverse/harvest/server/web/servlet/OAIServlet.java#L161 - if an exporter is set as isHarvestable()=true and it is an XML format, I think it is made available as an option for harvesting. I'm not sure if XML is a requirement based on the spec or just a Dataverse choice.

We don't yet have the equivalent of the exporter spi to make importers, but if the idea here is just to let non-Dataverse catalogs harvest DV content, and it's XML, I think you just have to create/install the exporter you want.

philippconzett commented 1 month ago

Thanks for the feedback! I'm not sure if I understand the technical details. Could we schedule a call with Jim and/or Phil and those interested?

johannes-darms commented 1 month ago

@qqmyers: That's great, I wasn't aware of this feature! I thought we were talking about the other way round, collecting more metadata from other repositories...

@philippconzett That would be nice, we or at least one of us (@vera, @julian-schneider, @johannes-darms ) would like to join.

philippconzett commented 1 month ago

Great! I've created a when2meet calendar to help us schedule a call. I'll be on and off in vacation mode from today, but maybe Thursday or Friday next week could work for most of us?

It would be good if someone knowing the details of metadata export could join. I see that @poikilotherm, @qqmyers, and @pdurbin have contributed to the GDCC dataverse-exporters GitHub repo.

philippconzett commented 1 month ago

Just to make sure I'm on the right track: The functionality @qqmyers refers to above, is the one described in section Metadata Export Formats in the Developer Guide?

poikilotherm commented 1 month ago

Yes indeed!

philippconzett commented 1 month ago

@qqmyers Thanks for filling in the when2meet calendar!

Pinging @vera, @julian-schneider, @johannes-darms, @DS-INRA, @gwendoux

I've created a collaborative notes doc. It currently contains a brief description of the DataverseNO-SIOS use case and how we could approach it to make the requested feature useful for other, similar use cases in the Dataverse community. Please feel free to contribute! Thanks!

poikilotherm commented 1 month ago

Leaving a note here that I shamelessly made use of my admin rights and #2721 to the initial description.

Another note: I've been talking about creating an XML-RDF exporter for a long time now. That's the way to go when you want to expose all metadata in XML without much need for configuration.

Not sure if we'd prefer some standalone thing specialised in XML stuff or if we want to look into using sth like https://github.com/gdcc/exporter-transformer.

Also, not sure if these issues are related with regards to technical implementation: #10042, #9344, #10000

philippconzett commented 1 month ago

Thanks all for indicating your availability. I've sent you a calendar invite. Please let me know if you haven't got it.