Closed RightInTwo closed 4 years ago
What we are currently doing to prepare our repository:
Now that is the stuff I want to get into our institutional dataverse. This is only about metadata! The data would reside at its original source.
@donsizemore You mentioned Python code in the chat. What does it do exactly?
@djbrooke @pdurbin Hey Guys! Would it make sense to break this down in some way? Or is an issue consolidation in progress/to be expected for this as well?
@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. π For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse
Also, breaking down issues is almost always good. It makes them easier to estimate. π
@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. smile
Very nice. Sign me up! Don't think that your threat will stop me :laughing:
For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse
I feel honored to be mentioned :D
Also, breaking down issues is almost always good. It makes them easier to estimate. +1
I'd be glad to. It would be great if some other people with general interest in harvesting features joined on this issue to make it easier to smash it into digestible pieces and prioritize them. Maybe there are also already good solutions to (some of) the problems in existence... @pdurbin, could you help me out with some more of your community magic?
How about re-framing this as "Harvest metadata from a list of DOIs"?
How about re-framing this as "Harvest metadata from a list of DOIs"?
Maybe. Maybe we should try to tell a user story. How's this?
"As a user, I'd like to collect datasets in Dataverse based on metadata available in DataCite. These datasets would behave somewhat like harvested datasets in that they are read only and would clearly indicate that they did not originate in Dataverse."
I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?
I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?
We don't publish any data ourselves. Therefore, it is necessary to collect references (DOIs) from the diverse places where data has been published.
For example, our unit DD is responsible for Components 1 and 6 of the German Longitudinal Election Study. The data resides at GESIS, but we would like to reference it in our insitutional repository (based on Dataverse). So we would like to add the following DOIs from that page to our catalogue and map (link) them to the dataverse of the unit DD and to those of individual researchers (if they want to). Also, we want to use that information to feed the CRIS.
https://doi.org/10.4232/1.13089 https://doi.org/10.4232/1.12722 https://doi.org/10.4232/1.12808 https://doi.org/10.4232/1.12809 https://doi.org/10.4232/1.13168 https://doi.org/10.4232/1.13137 https://doi.org/10.4232/1.13138 https://doi.org/10.4232/1.13139 https://doi.org/10.4232/1.12804 https://doi.org/10.4232/1.12805 https://doi.org/10.4232/1.12806 https://doi.org/10.4232/1.12443 https://doi.org/10.4232/1.12043 https://doi.org/10.4232/1.11443 https://doi.org/10.4232/1.11444
Metadata should be retrieved from the best source available:
@RightInTwo thanks, this is helping. For now could you use the "Related Datasets" field to collect those DOIs? I just tried this at https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/24U2VG and here's a screenshot:
The "Related Datasets" field is multivalued, which is nice, and it supports HTML, so I was able to link to the DOIs, but there isn't much structure to it. It all just goes in a single text area. What do you think? What does @jggautier think? π
For now I would just use the "Other ID" field, but it would be best to have the DOI in the actual "Dataset Persistent ID" field.
We are currently collecting them in a database outside of Dataverse, but at some point it would be great to get them in there together with the metadata - before we manage that, we don't really want to make our Dataverse public (not even within the institute).
(Improving on #5998 would be appreciated anyways...)
One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH
I have no idea if this factoid is helpful or not but Dataverse can harvest its own native JSON format over OAI-PMH. This means that every single metadata field is available, even custom metadata blocks. (That's my understanding anyway.) The downside, of course, is that you'd have to implement our crazy native JSON format in the harvesting server you create. :smile:
Wouldn't it be easier to implement a separate service for this?
I'm also thinking in the direction of maybe slicing up Dataverse a bit and maybe move the complete harvesting into a separated module. It could run on its own, offer easier scaling and use the Dataverse API to load new stuff into the database. (No microservice, but a modulith)
It could either use Quarkus/Spring (stay'n in Java) or Python (excellent pyDataverse) :wink:
@poikilotherm Hi Oliver, that is kind of what I'm building now, except that I don't use any of the libraries but rather try to build it without constraints or validation in JS with jquery. Not because it is so great, but because the colleagues who will take over for me (my contract ends in March) don't have any programming background except for some jquery runtime manipulation in the browser...
Now, that all could be totally different if... a) ...Dataverse would support and maintain that functionality. That would be of course a lot of additional work and I understand that it may (currently) be out of scope. b) ...we would develop something together! :D I'm not deep into python, but pyDataverse seems very promising. And if there was a community effort on this, I'd be glad to be part of it and dump all that messy JS for good. @poikilotherm I could do the grunt work if you do the code structure and the QA :D
One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH Dataverse can harvest its own native JSON format over OAI-PMH
Good point! Maybe a small OAI-PMH server could be part of the solution then.
@donsizemore We once had a chat about this topic - are you still interested?
@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! π
Developed primarily by @tainguyenbui it may be new but it's moving fast! And it's on npm.
@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! tada
Yes, I just discovered that yesterday! Let's see what the other people on here think about the choices regarding language and architecture. Maybe @skasberger could also contribute with his opinion?
Here is some code as an example for how a quick and dirty import of Datacite metadata via DDI-XML works. After some experiments with mapping from one python dict to another, wanting to create a dict in the dataverse json format, I ended up with a solution that really earns the "quick and dirty" tag: Just insert everything into a string in the DDI-XML format accepted by /datasets/:importddi.
```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import json
from jsonpath_ng.ext import parse as parseJsonPath
from requests import get, post
# pyDataverse doesn't provide an API to import DDI-XML yet, so we just use requests
#from pyDataverse.api import Api
#from pyDataverse.models import Dataverse
#%%
######################################################
# Setup
## Define API URLs
apimethods = {
'datacite_get_datacitejson': {
'usage': "get the datacite+json representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
'url': 'https://data.datacite.org/application/vnd.datacite.datacite+json/'
},
'datacite_get_xbibliography': {
'usage': "get the x-bibliography representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
'url': 'https://data.datacite.org/text/x-bibliography/'
}
}
## Provide API key
apikey = {
'wzbdataverse': '{insert API key here}'
}
## Provide base URL
baseurl = {
'wzbdataverse': 'https://dataverse.wzb.eu'
}
#%%
######################################################
# QUICK AND DIRTY function to map from Datacite+json (as a python dict) to DDI-XML (as a string)
def ddiXmlFromDoi(doi):
md = json.loads(get(apimethods['datacite_get_datacitejson']['url'] + doi).content)
citation = get(apimethods['datacite_get_xbibliography']['url'] + doi).content
issueDate = parseJsonPath('$.dates[?dateType="Issued"].date').find(md)[0].value
pubYear = parseJsonPath('$.publicationYear').find(md)[0].value
title = parseJsonPath('$.titles.[0].title').find(md)[0].value
creators = parseJsonPath('$.creators').find(md)[0].value
keywords = parseJsonPath('$.subjects').find(md)[0].value
descriptions = parseJsonPath('$.descriptions').find(md)[0].value
version = 1
subTitle = ''
ddixml = f"""
With a custom OAI-PMH server (which holds the metadata for a specified list of DOIs, also see #6425 ), the solution could be archived with a harvesting client in Dataverse. Step 1 & 2 would be run regularly (daily?).
(green: exists, red: todo)
@tcoupin @pdurbin @djbrooke Since this is not going to be a core feature, where should this project reside and under what name? In the IQSS github, named something like "doi2pmh-server"?
@RightInTwo I could create an empty repo for you if you want. You'd want to mention prominently in the README that it's community supported. Nice diagram! (And nice code earlier. π )
@RightInTwo I could create an empty repo for you if you want.
@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.
Yes π
Le 21 janv. 2020 Γ 22:48, Jonas notifications@github.com a Γ©crit :
@RightInTwo I could create an empty repo for you if you want.
@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
@RightInTwo @tcoupin ok I just created https://github.com/IQSS/doi2pmh-server and made you admins of it. Again, please make sure you indicate that this a community supported project. Have fun you two. :smile:
Thanks @pdurbin for setting up that repo.
@tcoupin @RightInTwo Thanks for working on this. I think a solution that allows institutions to easily set up their collections in an OAI-PMH server and then have the metadata reflected in Dataverse for discoverability purposes is great.
@pdurbin @djbrooke Thanks for making this happen!
@poikilotherm @tcoupin See you on the other side!
See https://github.com/IQSS/doi2pmh-server for the continuation!
I would like to harvest heterogeneous sources that don't necessarily present the datasets I need through OAI-PMH or in the form I need them. The issues I see with OAI-PMH:
These datasets would be described and updated using the metadata for the DOIs supplied by Datacite and Crossref through
the import API (which is currently not its purpose!). One solution would also be to set upan own harvesting server., but that would limit the abilities (metadata fields) to those supplied by OAI-PMH and create quite a big overhead.