The-Academic-Observatory / academic-observatory-workflows

Telescopes, Workflows and Data Services for the Academic Observatory
https://academic-observatory-workflows.readthedocs.io
Apache License 2.0
16 stars 1 forks source link

Telescope workflow implementation: OpenAire #15

Open aroelo opened 4 years ago

aroelo commented 4 years ago

There are 3 ways of bulk accessing the openaire data:

OpenAIRE Research Graph Dumps Can be downloaded from Zenodo (https://zenodo.org/search?page=1&size=20&q=OpenAIRE%20Research%20Graph%20Dump) or explored through their beta portal. There is one dump available from 18-12-2019 and one from 03-11-2020, which also has an updated json schema.

Each publication on Zenodo contains several dumps/files, the 2019 one is slightly different than 2020. 2019 files:

publication.gz: metadata records about research literature (includes types of publications listed here)
dataset.gz:: metadata records about research data (includes the subtypes listed here)
software.gz:: metadata records about research software (includes the subtypes listed here)
orp.gz: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.gz: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.gz: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.gz: metadata records about projects funded by a given funder.
<funder>_result.gz: metadata records about research results (publications, datasets, software, and other research products) funded by a given funder.

2020 files:

publication_[part].tar: metadata records about research literature (includes types of publications listed here)
dataset.tar: metadata records about research data (includes the subtypes listed here)
software.tar: metadata records about research software (includes the subtypes listed here)
otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.tar: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.tar: metadata records about projects funded by a given funder.
relation_[part].tar: metadata records about relations between entities in the graph
communities_infrastructures.tar: metadata records about research communities and research infrastructures

This image from https://doi.org/10.5281/zenodo.4238939 helps to understand the relationship between these files. image

OAI-PMH The OAI-PMH harvester is available as well, one note:

Currently the OAI-PMH publisher is not supporting incremental harvesting. Although the usage of the OAI parameters 'from' and 'until' is handled by the OAI publisher, the datestamps of metadata records are updated about every week.

I'm not sure what they mean with 'the datestamps of metadata record are updated about every week'. Considering the data size it might be best to initially download the dumps instead of using the OAI-PMH harvester. Perhaps the harvester can be used to update the data regularly with newly added/edited records, but I'm skeptical since they mention above that incremental harvesting is not supported.

Bulk access to projects The APIs offer custom access to metadata about projects funded by a selection of international funders for the DSpace and EPrints platforms. The currently supported funding streams and relative codes are:

FP7: The 7th Framework Programme funded by the European Commission
WT: Wellcome Trust funding programme
H2020: Horizon2020 Programme funded by the European Commission
FCT: The funding programme of Fundação para a Ciência e a Tecnologia, the national funding agency of Portugal
ARC: the funding programme of the Australian Research Council
NHMRC: the funding programme of the Australian National Health and Medical Research Council
SFI: Science Foundation Ireland
HRZZ: Croatian Science Foundation
MZOS: Ministry of Science, Education and Sports of the Republic of Croatia
MESTD: The Ministry of Education, Science and Technological Development of Serbia
NWO: The Netherlands Organisation for Scientific Research

I'm not sure if this is of interest to us. I think this project data is included in the Zenodo files as well and this is just an alternative easy way if you're interested in a specific project.

Questions:

rhosking commented 3 years ago

Given how recent the last Zenodo dump was, I think that might be a good start getting that. It's always hard to know how often it will be updated though, if at all. A general example of downloading content from Zenodo will likely be useful, as there are other datasets also hosted there, which might become future telescopes. I'm guessing the OAI-PMH stuff is difficult to harvest sequentially due to lots of existing records always being updated, but it's hard to tell from their description.

Separate tables will be fine, that resembles how MAG looks, so I can write some SQL for bringing that key bits together.

In terms of scheduling, that's a difficult one, it's almost a only_once run. There is value in just mapping out all the schemas, and potentially doing some parts manually so we can have an initial view into the data. However, if turning it into a full telescope isn't much further work, then having an example of getting content from Zenodo is helpful in it's own right.