RDA-DMP-Common / hackathon-2020

RDA hackathon on maDMPs
The Unlicense
6 stars 6 forks source link

Export/Import maDMP from Figshare #17

Open peterneish opened 4 years ago

peterneish commented 4 years ago

First step: extract maDMP from figshare repository using the figshare API https://docs.figshare.com/

Second step: test importing of an maDMP into figshare

jomtov commented 4 years ago

Hi, I'd be interested to participate in this issue. At Stockholm University (SU) we have an instance su.figshare.com, from which we harvest and transform items to Swedish archival standard (FGS-CSPackage), extracting figshare file metadata and datafiles, by means of the base API "https://api.figshare.com/v2/articles/" as a complement to the OAI-PMH-feeds in METS ( https://api.figshare.com/v2/oai?verb=ListRecords&metadataPrefix=mets) + some other metadata sources (eg. local staff catalog, ORCID). We also use DMP Online and we have a custom metadata field for a reference to a DMP (not mandatory, and not much in use yet) in our su.figshare.com web-form, so it would be great if we could extract maDMPs from su.figshare.com and then import them into DMP Online (which might be the task of another issue here). Unfortunately, we have a Carpentries workshop the very same days as the hackathon (27 May pm - 28 May am), so don't know to what extent I could participate and contribute.

peterneish commented 4 years ago

Sounds interesting and aligned to our setup. We also have DMP Online, so would be interested in working on this integration.

jomtov commented 4 years ago

Should add first that our present harvest and transform from su.figshare.com by means of the base API essentially uses XML-techniques, such as (BaseX) xquery and xslt, so the JSON output from the Figshare API is read / transformed to xml in the XQ-script via 3 variables:

$url := concat("https://api.figshare.com/v2/articles/",$u)
$jsonMD := html:parse(unparsed-text($url))
$json2xml := json:parse($jsonMD)

This is only to say that transformation directly from JSON to JSON will be a new challenge for me, that I have no previous experience from. Further would like to clarify (since I had this question from local colleagues) that we are talking about the same thing here, that the first step above, extracting maDMPs from Figshare by mean of the API would mean actually creating maDMP JSON-output from datasets in figshare, (as has been done attempts with in Dataverse, https://github.com/oblassers/dmap/issues/1 , https://hido1994.github.io/madmp/ ), not extracting already existing DMPs in Figshare - at least in our instance su.figshare.com there are none to my knowledge. As for the second part, the possibility of importing the output from step 1, into DMP Online, we should work with / at least keep a busy eye on the work in issue #2 and the DMP Exchange Team, @briri , @xsrust, @sjDCC et al.

peterneish commented 4 years ago

Yes, that is right - no existing DMPs in our Figshare (that I know about), would be creating maDMPs from Figshare datasets.

jomtov commented 4 years ago

Good, that's settled then!

briri commented 4 years ago

This is an interesting idea. Is the long term goal to monitor Figshare and collect research outputs and connect them back to your DMPs so that you can do things like compliance checking?

jomtov commented 4 years ago

Yes! (- and / or - as an extension other repositories like Dataverse, Zenodo), to facilitate compliance checking not only with RDA DMP Common Standard, but also with requirements of "local templates" / answers and Data Policies. That is also why (partly) I stick to "old" xml-techniques like Schematron for validation, which allows for tailor-made, phased "diagnostics" and differential validation according to choice of template etc.

briri commented 4 years ago

That's great and a use case we have identified as well.

You should be able to validate your JSON against the RDA common standard schema (in theory!). I have not tried it yet, but it is something I hope to do in the near future. Here's a doc on it: https://json-schema.org/implementations.html. I think most languages have a plugin/library for it.

I did something similar with a prototype script that scans the National Science Foundation (NSF) awards API to find grant ids/urls associated with our DMPs (NSF has specific DMP templates in our system so we are able to narrow down which DMPs are likely to be in their awards data). It unfortunately requires us to do name/title matching which requires that the user named their DMP closely to how they titled their grant proposal 🙄. The title comparison algorithm is a bit naive but we did have some success when checking against all of our historical DMPs. There is some danger I suppose when encountering false positives, so some sort of curator intervention would likely be needed if we ever implement it.

jomtov commented 4 years ago

I did validate JSON output examples from both DMP Online and Data Stewardship Wizard 2.1 against the RDA common standard schema (maDMP-schema-1.0.json), as I reported in maDMP.slack.com/ #machine-actionable-dmps, where I also noted: "The present RDA maDMP-schema will not (yet) serve as an efficient tool for (self-)evaluation and review of DMPs, since it leaves out and does not validate the largest, and perhaps most important part of a DMP, the answers to the questions asked in the plan. To develop a validation schema also for these parts of a DMP, however, would further require much more of standardisation of possible answers (e.g. through enumeration lists, data typing etc.) in the DMP templates of various tools." But, thank's for the link to json-schema implementations, might provide some other useful tools for this work (I am already using the Oxygen JSON editor listed there). Interesting to learn about your name/title matching to find grant ids/urls; it seems you are at least doing better than we do for some personal name searches to get ORCID iDs of our depositors to su.figshare, for those that have failed to connect their ORCID iD to their su.figshare account, that is.

Ebrahim1010 commented 4 years ago

@peterneish thanks for sharing this issue with me. We will be attempting this integration as well