Open aroelo opened 4 years ago
Given how recent the last Zenodo dump was, I think that might be a good start getting that. It's always hard to know how often it will be updated though, if at all. A general example of downloading content from Zenodo will likely be useful, as there are other datasets also hosted there, which might become future telescopes. I'm guessing the OAI-PMH stuff is difficult to harvest sequentially due to lots of existing records always being updated, but it's hard to tell from their description.
Separate tables will be fine, that resembles how MAG looks, so I can write some SQL for bringing that key bits together.
In terms of scheduling, that's a difficult one, it's almost a only_once run. There is value in just mapping out all the schemas, and potentially doing some parts manually so we can have an initial view into the data. However, if turning it into a full telescope isn't much further work, then having an example of getting content from Zenodo is helpful in it's own right.
There are 3 ways of bulk accessing the openaire data:
OpenAIRE Research Graph Dumps Can be downloaded from Zenodo (https://zenodo.org/search?page=1&size=20&q=OpenAIRE%20Research%20Graph%20Dump) or explored through their beta portal. There is one dump available from 18-12-2019 and one from 03-11-2020, which also has an updated json schema.
Each publication on Zenodo contains several dumps/files, the 2019 one is slightly different than 2020. 2019 files:
2020 files:
This image from https://doi.org/10.5281/zenodo.4238939 helps to understand the relationship between these files.
OAI-PMH The OAI-PMH harvester is available as well, one note:
I'm not sure what they mean with 'the datestamps of metadata record are updated about every week'. Considering the data size it might be best to initially download the dumps instead of using the OAI-PMH harvester. Perhaps the harvester can be used to update the data regularly with newly added/edited records, but I'm skeptical since they mention above that incremental harvesting is not supported.
Bulk access to projects The APIs offer custom access to metadata about projects funded by a selection of international funders for the DSpace and EPrints platforms. The currently supported funding streams and relative codes are:
I'm not sure if this is of interest to us. I think this project data is included in the Zenodo files as well and this is just an alternative easy way if you're interested in a specific project.
Questions: