landreev commented 1 month ago

DataCite maintains an OAI server (https://oai.datacite.org/oai) serving records for every DOI they have registered. There is a lot of interest in being able to harvest from them (since these are all registered DOIs, they will be redirecting to the original archival location of the actual studies/datasets etc.)

There is a couple of issues that must be addressed before our OAI client implementation is able to do that.

The oai_dc import code in Dataverse expects the metadata fragment to be self-contained, and, most importantly have the main persistent identifier (the DOI in this case) to be present in the <dc:identifier> field. DataCite however does not include the main DOI in the oai_dc - since they are using these DOIs as the OAI identifiers as well, they assume that it is enough to include them in the OAI record header, in the <identifier> field, like this:

<record>
<header>
  <identifier>doi:10.7910/dvn/tjclkp</identifier>
  <datestamp>2023-01-03T21:08:00Z</datestamp>
  <setSpec>HARVARDU</setSpec>
  <setSpec>GDCC.HARVARD-DV</setSpec>
</header>
<metadata>
  <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
     <dc:title>Open Source at Harvard</dc:title>
     <dc:creator>Durbin, Philip</dc:creator>
     <dc:publisher>Harvard Dataverse</dc:publisher>
     <dc:date>2017</dc:date>
     <dc:date>Issued: 2017</dc:date>
     <dc:description>The tabular file contains information ...</dc:description>
     <dc:contributor>Durbin, Philip</dc:contributor>
     <dc:type>Dataset</dc:type>
 </oai_dc:dc>
</metadata>
</record>

Without the <dc:identifier>, our code in its current form is failing to import the record above. All that needs to be done, we need to add some logic to use the identifier from the OAI header in situations like this. (We actually used to do that in one of the previous iterations of the harvester).

DataCite OAI implementation offers a very promising feature of accepting arbitrary search queries as the OAI set names (https://support.datacite.org/docs/datacite-oai-pmh#arbitrary-queries). This would make it possible to harvest individual records by the DOIs (something we've been asked for specifically) or any possible subsets of their offerings. Example:
```
echo "doi%3A10.7910/DVN/TJCLKP" | base64 
ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
```
Now you can harvest this "set" made up of one dataset above, as in https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=~ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg== Unfortunately for whatever reason, the above notation only works in ListRecords, but not in ListIdentifiers, that Dataverse actually uses. From talking to Datacite, they may be able to fix it eventually - but not in an instant, "oh yeah, we just had this one line commented out" way. We should go ahead and implement support for harvesting using ListRecords (it should be faster, if nothing else; we handle it via ListIdentifiers then GetRecord, one record at a time, for various historical reasons - but it may come handy in other situations, to have both modes supported (and configurable, per client maybe?)

Clearly, we don't want to touch the current, JSF-based harvesting clients UI. But making the changes above, in the import and harvesting back end code, and then making it possible to set up or configure a client via the /api/harvest/clients API to take advantage of these improvements should be both useful and sufficient.

DS-INRAE commented 1 month ago

Item 1. is also true for other repositories, and would greatly enhance Dataverse's harvesting capacity :smiley:

scolapasta commented 1 month ago

10936

landreev commented 4 weeks ago

There is an extra issue @scolapasta pointed out in #10937 that I'm adding as task 3. here - in the current scheme of things the set name is stored in the database as a varchar(255). It should be changed to an unlimited text field, since it will be used for arbitrary DataCite search queries. For example, in our immediate use case this is likely going to be a very long list of individual DOIs.

fgassert commented 2 weeks ago

Hi Folks, Glad to see this moving forward 🙇 ! Just a comment that might inform the implementation of this. If you end up changing the harvesting client behavior to hit only ListRecords, this could potentially also allow for the harvesting of any static xml document mirroring the ListRecords response. This opens the door to other potential workarounds for harvesting other metadata.

Here's an example: https://groups.google.com/g/dataverse-community/c/XrQsCTVZzAE/m/vVIFL6xeDwAJ

IQSS / dataverse

Add support for OAI-harvesting from DataCite #10909

2 has been split off as https://github.com/IQSS/dataverse/issues/10936