RightInTwo commented 5 years ago

See https://github.com/IQSS/doi2pmh-server for the continuation!

I would like to harvest heterogeneous sources that don't necessarily present the datasets I need through OAI-PMH or in the form I need them. The issues I see with OAI-PMH:

OAI-PMH interfaces don't necessarily exist for every source of datasets I want to use
OAI-PMH sets need to be defined at the source
Harvested data will go into one dataverse, with no ability to map specific datasets to dataverses
Supplied metadata is sometimes insufficient and harvested metadata can not be augmented (e.g. the information that our institute has an existing DUA for that data)
The granularity of the original data is not necessarily the one wanted in the repository (e.g. for a longitudinal study, we want to group all years into one dataset that describes the study as a whole. This is only a reference for scientists to increase discoverability of these data that are maintained at the external source)

These datasets would be described and updated using the metadata for the DOIs supplied by Datacite and Crossref through ~~the import API (which is currently not its purpose!). One solution would also be to set up~~ an own harvesting server.~~, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH and create quite a big overhead.~~

RightInTwo commented 5 years ago

What we are currently doing to prepare our repository:

Collect existing DOIs for relevant objects
Get basic metadata from Datacite (next step: also query Crossref and as a last resort: query the repository directly)
Categorize the objects (A. data we use from external sources and would like to reference in our repository, B. data from our institute published in external repositories)
Enrich metadata: research unit within our institute

Now that is the stuff I want to get into our institutional dataverse. This is only about metadata! The data would reside at its original source.

RightInTwo commented 5 years ago

5104 seems to be closely related

RightInTwo commented 5 years ago

@donsizemore You mentioned Python code in the chat. What does it do exactly?

RightInTwo commented 5 years ago

@djbrooke @pdurbin Hey Guys! Would it make sense to break this down in some way? Or is an issue consolidation in progress/to be expected for this as well?

pdurbin commented 5 years ago

@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. 😄 For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

Also, breaking down issues is almost always good. It makes them easier to estimate. 👍

RightInTwo commented 5 years ago

@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. smile

Very nice. Sign me up! Don't think that your threat will stop me :laughing:

For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

I feel honored to be mentioned :D

Also, breaking down issues is almost always good. It makes them easier to estimate. +1

I'd be glad to. It would be great if some other people with general interest in harvesting features joined on this issue to make it easier to smash it into digestible pieces and prioritize them. Maybe there are also already good solutions to (some of) the problems in existence... @pdurbin, could you help me out with some more of your community magic?

RightInTwo commented 5 years ago

How about re-framing this as "Harvest metadata from a list of DOIs"?

pdurbin commented 5 years ago

How about re-framing this as "Harvest metadata from a list of DOIs"?

Maybe. Maybe we should try to tell a user story. How's this?

"As a user, I'd like to collect datasets in Dataverse based on metadata available in DataCite. These datasets would behave somewhat like harvested datasets in that they are read only and would clearly indicate that they did not originate in Dataverse."

I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?

RightInTwo commented 5 years ago

I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?

We don't publish any data ourselves. Therefore, it is necessary to collect references (DOIs) from the diverse places where data has been published.

For example, our unit DD is responsible for Components 1 and 6 of the German Longitudinal Election Study. The data resides at GESIS, but we would like to reference it in our insitutional repository (based on Dataverse). So we would like to add the following DOIs from that page to our catalogue and map (link) them to the dataverse of the unit DD and to those of individual researchers (if they want to). Also, we want to use that information to feed the CRIS.

link

https://doi.org/10.4232/1.13089 https://doi.org/10.4232/1.12722 https://doi.org/10.4232/1.12808 https://doi.org/10.4232/1.12809 https://doi.org/10.4232/1.13168 https://doi.org/10.4232/1.13137 https://doi.org/10.4232/1.13138 https://doi.org/10.4232/1.13139 https://doi.org/10.4232/1.12804 https://doi.org/10.4232/1.12805 https://doi.org/10.4232/1.12806 https://doi.org/10.4232/1.12443 https://doi.org/10.4232/1.12043 https://doi.org/10.4232/1.11443 https://doi.org/10.4232/1.11444

RightInTwo commented 5 years ago

Metadata should be retrieved from the best source available:

By content negotiation on the landing page (get rich metadata through DDI, OAI-ORE, ... - for a first implementation, this should be skipped because there are a lot of dependecies based on the repository software used)
From Datacite
From Crossref

pdurbin commented 5 years ago

@RightInTwo thanks, this is helping. For now could you use the "Related Datasets" field to collect those DOIs? I just tried this at https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/24U2VG and here's a screenshot:

Screen Shot 2019-07-25 at 10 45 26 AM

The "Related Datasets" field is multivalued, which is nice, and it supports HTML, so I was able to link to the DOIs, but there isn't much structure to it. It all just goes in a single text area. What do you think? What does @jggautier think? 😄

RightInTwo commented 5 years ago

For now I would just use the "Other ID" field, but it would be best to have the DOI in the actual "Dataset Persistent ID" field.

We are currently collecting them in a database outside of Dataverse, but at some point it would be great to get them in there together with the metadata - before we manage that, we don't really want to make our Dataverse public (not even within the institute).

(Improving on #5998 would be appreciated anyways...)

pdurbin commented 4 years ago

One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH

I have no idea if this factoid is helpful or not but Dataverse can harvest its own native JSON format over OAI-PMH. This means that every single metadata field is available, even custom metadata blocks. (That's my understanding anyway.) The downside, of course, is that you'd have to implement our crazy native JSON format in the harvesting server you create. :smile:

poikilotherm commented 4 years ago

Wouldn't it be easier to implement a separate service for this?

I'm also thinking in the direction of maybe slicing up Dataverse a bit and maybe move the complete harvesting into a separated module. It could run on its own, offer easier scaling and use the Dataverse API to load new stuff into the database. (No microservice, but a modulith)

It could either use Quarkus/Spring (stay'n in Java) or Python (excellent pyDataverse) :wink:

RightInTwo commented 4 years ago

@poikilotherm Hi Oliver, that is kind of what I'm building now, except that I don't use any of the libraries but rather try to build it without constraints or validation in JS with jquery. Not because it is so great, but because the colleagues who will take over for me (my contract ends in March) don't have any programming background except for some jquery runtime manipulation in the browser...

Now, that all could be totally different if... a) ...Dataverse would support and maintain that functionality. That would be of course a lot of additional work and I understand that it may (currently) be out of scope. b) ...we would develop something together! :D I'm not deep into python, but pyDataverse seems very promising. And if there was a community effort on this, I'd be glad to be part of it and dump all that messy JS for good. @poikilotherm I could do the grunt work if you do the code structure and the QA :D

One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH Dataverse can harvest its own native JSON format over OAI-PMH

Good point! Maybe a small OAI-PMH server could be part of the solution then.

@donsizemore We once had a chat about this topic - are you still interested?

pdurbin commented 4 years ago

@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! 🎉

Developed primarily by @tainguyenbui it may be new but it's moving fast! And it's on npm.

RightInTwo commented 4 years ago

@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! tada

Yes, I just discovered that yesterday! Let's see what the other people on here think about the choices regarding language and architecture. Maybe @skasberger could also contribute with his opinion?

RightInTwo commented 4 years ago

Here is some code as an example for how a quick and dirty import of Datacite metadata via DDI-XML works. After some experiments with mapping from one python dict to another, wanting to create a dict in the dataverse json format, I ended up with a solution that really earns the "quick and dirty" tag: Just insert everything into a string in the DDI-XML format accepted by /datasets/:importddi.

Click here for the python code (just as an example - should I put it in an own repo?)

```python #!/usr/bin/env python3 # -*- coding: utf-8 -*- import json from jsonpath_ng.ext import parse as parseJsonPath from requests import get, post # pyDataverse doesn't provide an API to import DDI-XML yet, so we just use requests #from pyDataverse.api import Api #from pyDataverse.models import Dataverse #%% ###################################################### # Setup ## Define API URLs apimethods = { 'datacite_get_datacitejson': { 'usage': "get the datacite+json representation of the DOI metadata. you need to append a DOI (just the ID!) to the url", 'url': 'https://data.datacite.org/application/vnd.datacite.datacite+json/' }, 'datacite_get_xbibliography': { 'usage': "get the x-bibliography representation of the DOI metadata. you need to append a DOI (just the ID!) to the url", 'url': 'https://data.datacite.org/text/x-bibliography/' } } ## Provide API key apikey = { 'wzbdataverse': '{insert API key here}' } ## Provide base URL baseurl = { 'wzbdataverse': 'https://dataverse.wzb.eu' } #%% ###################################################### # QUICK AND DIRTY function to map from Datacite+json (as a python dict) to DDI-XML (as a string) def ddiXmlFromDoi(doi): md = json.loads(get(apimethods['datacite_get_datacitejson']['url'] + doi).content) citation = get(apimethods['datacite_get_xbibliography']['url'] + doi).content issueDate = parseJsonPath('$.dates[?dateType="Issued"].date').find(md)[0].value pubYear = parseJsonPath('$.publicationYear').find(md)[0].value title = parseJsonPath('$.titles.[0].title').find(md)[0].value creators = parseJsonPath('$.creators').find(md)[0].value keywords = parseJsonPath('$.subjects').find(md)[0].value descriptions = parseJsonPath('$.descriptions').find(md)[0].value version = 1 subTitle = '' ddixml = f""" {title} doi:{doi} {md['publisher']} {issueDate} {version} {citation} {title} {subTitle} doi:{doi} """ ### creators for creator in creators: affiliation = '' if('affiliation' in creator): affiliation = creator['affiliation'] name = creator['name'] ddixml += f'{name}' ddixml += f""" {pubYear} {md['publisher']} {issueDate} """ ### subjects for keyword in keywords: word = keyword['subject'] scheme = '' if 'subjectScheme' in keyword: scheme = keyword['subjectScheme'] ddixml += f'{word}' ddixml += f""" """ ### abstracts / descriptions for desc in descriptions: descText = desc['description'] ddixml += f'{descText}' ddixml += """ {md['publisher']} """ return ddixml #%% ###################################################### # Loop through the list of DOIs, get the DDI-XML and import it into Dataverse dois = ['doi1', 'doi2', 'doi...', 'doin', ] for doi in dois: ddiXml = ddiXmlFromDoi(doi).encode(encoding='UTF-8') params = {} params['key'] = apikey['wzbdataverse'] import_api = f'/api/dataverses/open/datasets/:importddi?pid=doi:{doi}&release=yes' response = post(baseurl['wzbdataverse'] + import_api, data=ddiXml, params=params, verify=False) ```

RightInTwo commented 4 years ago

With a custom OAI-PMH server (which holds the metadata for a specified list of DOIs, also see #6425 ), the solution could be archived with a harvesting client in Dataverse. Step 1 & 2 would be run regularly (daily?).

(green: exists, red: todo)

RightInTwo commented 4 years ago

@tcoupin @pdurbin @djbrooke Since this is not going to be a core feature, where should this project reside and under what name? In the IQSS github, named something like "doi2pmh-server"?

pdurbin commented 4 years ago

@RightInTwo I could create an empty repo for you if you want. You'd want to mention prominently in the README that it's community supported. Nice diagram! (And nice code earlier. 😄 )

RightInTwo commented 4 years ago

@RightInTwo I could create an empty repo for you if you want.

@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.

tcoupin commented 4 years ago

Yes 😀

Le 21 janv. 2020 à 22:48, Jonas notifications@github.com a écrit :

@RightInTwo I could create an empty repo for you if you want.

@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

pdurbin commented 4 years ago

@RightInTwo @tcoupin ok I just created https://github.com/IQSS/doi2pmh-server and made you admins of it. Again, please make sure you indicate that this a community supported project. Have fun you two. :smile:

djbrooke commented 4 years ago

Thanks @pdurbin for setting up that repo.

@tcoupin @RightInTwo Thanks for working on this. I think a solution that allows institutions to easily set up their collections in an OAI-PMH server and then have the metadata reflected in Dataverse for discoverability purposes is great.

RightInTwo commented 4 years ago

@pdurbin @djbrooke Thanks for making this happen!

@poikilotherm @tcoupin See you on the other side!

IQSS / dataverse

Harvesting DOI metadata from non-OAI-PMH sources #5402

5104 seems to be closely related

link