CDLUC3 / dmsp_aws_prototype

Sceptre CloudFormation templates for DMPHub v2
MIT License
1 stars 0 forks source link

Investigate APIs for various repositories #58

Closed briri closed 3 months ago

briri commented 1 year ago

Here’s the list. We don’t need it to be exhaustive; just some examples would be helpful.

Particularly thinking about:

APIs:

briri commented 1 year ago

Mendeley analysis:

Mendeley includes ORCIDs and RORs 🎉

The API does not appear to support filtering/searching by ORCID and ROR though, it only allows searching by internal Mendeley ids (institution and profile)

The API also only supports discovery of public datasets. How would we discover metadata about private outputs?

The API uses OAuth2 (client_credentials here) so you must create an account with Elsevier and then add your application in the dev tools section.

# AUTH
# ---------------------------------
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -u[id]:[secret]-d "grant_type=client_credentials&scope=all" https://api.mendeley.com/oauth/token

# SEARCH
# ---------------------------------
curl -H 'Authorization: Bearer [token]' "https://api.mendeley.com/datasets?type=software&limit=2"

# Results
# ---------------------------------
{
  "results": [
    {
      "id":"000000000",
      "doi":{
        "id":"10.12345/000000000.3",
        "status":"allocated",
        "prefix":"10.12345"
      },
      "name":"Example name of a the output",
      "description":"Description of the output that deals with Paleomagnetism",
      "version":3,
      "contributors":[
        {
          "profile_id":"12345",
          "first_name":"Mickey",
          "last_name":"Mouse"
        },{
          "profile_id":"123456",
          "first_name":"Donald",
          "last_name":"Duck",
          "orcid_id":"0000-0000-0000-0000"
        },{
          "first_name":"Minnie",
          "last_name":"Mouse"
        }
      ],
      "versions":[
        {
          "version":3,
          "available":true,
          "publish_date":"2022-06-27T15:43:33.294Z"
        },{
          "version":2,
          "available":true,
          "publish_date":"2022-05-30T07:01:56.404Z"
        },{
          "version":1,
          "available":true,
          "publish_date":"2021-12-23T15:21:20.402Z"
        }
      ],
      "articles":[],
      "categories":[
        {
          "id":"data.elsevier.com/vocabulary/OmniScience/Concept-210636270",
          "label":"Paleomagnetism"
        }
      ],
      "institutions":[
        {
          "id":"999999999",
          "name":"Universidad Nacional Autonoma de Mexico"
        },{
          "id":"88888888888",
          "name":"Universidad de Sonora",
          "ror_id":"https://ror.org/00c32gy34"
        }
      ],
      "available":true,
      "size":0,
      "owner":{
        "profile_id":"12345",
        "first_name":"Mickey",
        "last_name":"Mouse"
      },
      "channel":"WEB",
      "owner_id":"12345",
      "publish_date":"2022-06-27T15:43:33.294Z",
      "data_licence":{
        "id":"01d9c749-3c4d-4431-9df3-620b2dcfe144",
        "description":"You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.",
        "url":"http://creativecommons.org/licenses/by/4.0",
        "category":"Creative",
        "short_name":"CC BY 4.0",
        "full_name":"Creative Commons Attribution 4.0 International"
      },
      "related_links":[],
      "funders":[],
      "customer_id":"555555555",
      "modified_on":"2022-06-26T01:28:14.149Z",
      "confidential":false,
      "links":{
        "view":"https://data.mendeley.com/datasets/000000000"
      },
      "repository":{
        "id":"MENDELEY_DATA",
        "name":"Mendeley Data"
      }
    }
  ]
}
briri commented 1 year ago

OSF is a bit opaque. We need to review the API docs in more detail since it uses a lot of graphDB language like GET /nodes/.

Its possible though to have multiple entry points. For example searching for preprints by title and then walking to the list of contributors OR starting with a search for contributor and then walking their preprints and nodes. Either way it will require multiple API calls and some of our own algorithms to verify matches.

Like Mendeley, it would be super useful if you could search by ROR or ORCID although I am not sure if their metadata contains those identifiers.

Need to use Postman or something similar since you need to obtain a OAuth2 Code before fetching a token. Didn't see a way to just use client_credentials.

Example of result from the preprints API: https://api.test.osf.io/v2/preprints. The available filters do not allow searching for a contributor name or institution. We would need to fetch results and then do our own filtering.

{
  "data": [
      {
          "id": "12345",
          "type": "preprints",
          "attributes": {
              "date_created": "2023-09-22T15:35:53.459096",
              "date_modified": "2023-09-22T15:37:17.496732",
              "date_published": "2023-09-22T15:37:16.381453",
              "original_publication_date": null,
              "doi": null,
              "title": "Test edit",
              "description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit",
              "is_published": true,
              "is_preprint_orphan": false,
              "license_record": {
                  "copyright_holders": [
                      ""
                  ],
                  "year": "2023"
              },
              "tags": [],
              "preprint_doi_created": null,
              "date_withdrawn": null,
              "current_user_permissions": [],
              "public": true,
              "reviews_state": "pending",
              "date_last_transitioned": "2023-09-22T15:37:16.381453",
              "has_coi": false,
              "conflict_of_interest_statement": null,
              "has_data_links": "no",
              "why_no_data": null,
              "data_links": [],
              "has_prereg_links": "no",
              "why_no_prereg": null,
              "prereg_links": [],
              "prereg_link_info": "",
              "subjects": [
                  [
                      {
                          "id": "59552881da3e240081ba3203",
                          "text": "Arts and Humanities"
                      }
                  ]
              ]
          },
          "relationships": {
              "contributors": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/contributors/",
                          "meta": {}
                      }
                  }
              },
              "bibliographic_contributors": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/bibliographic_contributors/",
                          "meta": {}
                      }
                  }
              },
              "citation": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/citation/",
                          "meta": {}
                      }
                  },
                  "data": {
                      "id": "8n27h",
                      "type": "preprints"
                  }
              },
              "identifiers": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/identifiers/",
                          "meta": {}
                      }
                  }
              },
              "node": {
                  "links": {
                      "self": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/relationships/node/",
                          "meta": {}
                      }
                  }
              },
              "license": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/licenses/000000000000000000/",
                          "meta": {}
                      }
                  },
                  "data": {
                      "id": "000000000000000000",
                      "type": "licenses"
                  }
              },
              "provider": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/providers/preprints/osf/",
                          "meta": {}
                      }
                  },
                  "data": {
                      "id": "osf",
                      "type": "preprint-providers"
                  }
              },
              "files": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/files/",
                          "meta": {}
                      }
                  }
              },
              "primary_file": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/files/000000000000000000/",
                          "meta": {}
                      }
                  },
                  "data": {
                      "id": "650db45d3cbde5000ad3eca2",
                      "type": "files"
                  }
              },
              "review_actions": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/review_actions/",
                          "meta": {}
                      }
                  }
              },
              "requests": {
                  "links": {
                      "related": {
                          "href": "https://api.test.osf.io/v2/preprints/12345/requests/",
                          "meta": {}
                      }
                  }
              }
          },
          "links": {
              "self": "https://api.test.osf.io/v2/preprints/12345/",
              "html": "https://test.osf.io/12345/",
              "preprint_doi": "https://doi.org/10.12345/ABC123.io/12345"
          }
      }
    ]
  }
briri commented 1 year ago

The Dataverse Search API allows access to published datasets.

Dataverse is an open source codebase and there are many installations out in the wild (e.g. Harvard), so we will likely need to have a table to store the target URLs and the searchable fields see this issue discussing that.

I am not seeing ORCID or ROR identifiers in the output, so we would need to do some messy text matching on names.

# EXAMPLE QUERIES:
# -------------------------------------
curl "https://demo.dataverse.org/api/search?q=*&type=dataset"
curl "https://demo.dataverse.org/api/search?q=trees"

# Example result
# -------------------------------------
{
  "status":"OK",
  "data":{
    "q":"*",
    "total_count":2792,
    "start":0,
    "spelling_alternatives":{},
    "items":[
      {
        "name":"test dataset #2",
        "type":"dataset",
        "url":"https://doi.org/10.12345/ABC/ZYXWVUT",
        "global_id":"doi:10.12345/ABC/ZYXWVUT",
        "description":"test creating dataset",
        "published_at":"2022-09-01T16:05:03Z",
        "publisher":"ABC",
        "citationHtml":"DOE, JANE, 2022, \"test dataset #2\", <a href=\"https://doi.org/10.12345/ABC/ZYXWVUT\" target=\"_blank\">https://doi.org/10.12345/ABC/ZYXWVUT</a>, Demo Dataverse, V1",
        "identifier_of_dataverse":"ABC",
        "name_of_dataverse":"ABC",
        "citation":"DOE, JANE, 2022, \"test dataset #2\", https://doi.org/10.12345/ABC/ZYXWVUT, Demo Dataverse, V1",
        "storageIdentifier":"s3://10.12345/ABC/ZYXWVUT",
        "subjects":["Arts and Humanities"],
        "fileCount":0,
        "versionId":220004,
        "versionState":"RELEASED",
        "majorVersion":1,
        "minorVersion":0,
        "createdAt":"2022-09-01T16:04:43Z",
        "updatedAt":"2022-09-01T16:05:03Z",
        "contacts":[{
          "name":"DOE, JANE",
          "affiliation":"University of California, Los Angeles"
        }],
        "publications":[{}],
        "authors":["DOE, JANE"]
      }
    ]
  }
}
briri commented 1 year ago

Zenodo allows searching for 'published' records.

They have some others in beta currently that allow searching for funders, grants, communities (e.g 'dryad') and licenses.

Their funder list uses Crossref funder DOIs currently.

The grants API returns info about a grant, but I'm not seeing any connection to the awardee

curl "https://zenodo.org/api/grants/"

{
  "created":"2023-04-12T13:47:31.657945+00:00",
  "id":"10.13039/501100000780::101103476",
  "links":{"self":"https://zenodo.org/api/grants/10.13039/501100000780::101103476"},
  "metadata":{
    "$schema":"http://zenodo.org/schemas/grants/grant-v1.0.0.json",
    "acronym":"ERA TALENT",
    "code":"101103476",
    "enddate":"2026-02-28",
    "funder":{
      "$schema":"http://zenodo.org/schemas/funders/funder-v1.0.0.json",
      "acronyms":[],
      "country":"",
      "doi":"10.13039/501100000780",
      "identifiers":{"oaf":"ec__________::EC"},
      "name":"European Commission",
      "parent":{},
      "remote_created":"2011-06-08T16:00:03.000000",
      "remote_modified":"2019-07-19T16:49:12.000000",
      "subtype":"national government",
      "type":"gov"
    },
    "identifiers":{
      "eurepo":"info:eu-repo/grantAgreement/EC/HE/101103476/",
      "oaf":"corda_____he::becfdc0f5223e4577c583857048ffcf2",
      "purl":null
    },
    "internal_id":"10.13039/501100000780::101103476",
    "legacy_id":"10.13039/501100000780::101103476",
    "program":"HE",
    "remote_modified":"2021-04-27",
    "startdate":"2023-03-01",
    "suggest":{
      "contexts":{
        "funder":["10.13039/501100000780"]
      },
      "input":["101103476","ERA TALENT","ERA TALENT Platform for career development of researchers in Europe"]
    },
    "title":"ERA TALENT Platform for career development of researchers in Europe",
    "url":""
  },
  "updated":"2023-04-12T13:47:31.657960+00:00"
}

The records (aka datasets) are pretty good (at least the ones provided by Dryad for NIH). The grant/award id is buried in the 'notes' field, but may be searchable/filterable.

Records allow for ORCID 🎉 but I am not seeing RORs.

Here is an example:

# Records that were funded by NIH
curl "https://zenodo.org/api/records/?q=grants.funder.doi:doi.org%2F10.13039%2F100000002"

{
  "conceptrecid":"1234567",
  "created":"2000-01-01T13:14:15.077983+00:00",
  "doi":"10.12345/ABC123",
  "files":[{
    "bucket":"0000000000000000000000",
    "checksum":"md5:abcdefghijklmnop",
    "key":"Biological_data.fcs",
    "links":{"self":"https://zenodo.org/api/files/00000000000000/Biological_data.fcs"},
    "size":22599873,
    "type":"fcs"
  }],
  "id":11111111,
  "links":{
    "badge":"https://zenodo.org/badge/doi/10.12345/ABC123.svg",
    "bucket":"https://zenodo.org/api/files/000000000000000000000",
    "doi":"https://doi.org/10.12345/ABC123",
    "html":"https://zenodo.org/record/1234567",
    "latest":"https://zenodo.org/api/records/1234567",
    "latest_html":"https://zenodo.org/record/1234567",
    "self":"https://zenodo.org/api/records/1234567"
  },
  "metadata":{
    "access_right":"open",
    "access_right_category":"success",
    "communities":[{"id":"dryad"}],
    "creators":[{
      "affiliation":"Example University",
      "name":"Doe, Jane",
      "orcid":"0000-0000-0000-0000"
    }],
    "description":"<p>Research data about biological stuff</p>",
    "doi":"12345/ABC123",
    "keywords":["human immunodeficiency virus (HIV)","cell death","apoptosis","pyroptosis","lymphoid tissues"],
    "license":{"id":"CC0-1.0"},
    "method":"<p>mass cytometry; single-cell RNA-seq</p>\n<p>mass cytometry data has been pre-gated on live singlets</p>",
    "notes":"<p>Funding provided by: National Institutes of Health<br>Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000002<br>Award Number: A12 0A123456</p>",
    "publication_date":"2000-01-01",
    "related_identifiers":[{
      "identifier":"10.98765/journal.2000.99999","relation":"isCitedBy","scheme":"doi"
    }],
    "relations":{
      "version":[{
        "count":1,
        "index":0,
        "is_last":true,
        "last_child":{"pid_type":"recid","pid_value":"8888888"},
        "parent":{"pid_type":"recid","pid_value":"7777777"}
      }]
    },
    "resource_type":{
      "title":"Dataset",
      "type":"dataset"
    },
    "title":"Data from: my research about biological stuff."
  },
  "owners":[00000],
  "revision":2,
  "stats":{
    "downloads":3.0,
    "unique_downloads":3.0,
    "unique_views":7.0,
    "version_downloads":3.0,
    "version_unique_downloads":3.0,
    "version_unique_views":7.0,
    "version_views":7.0,
    "version_volume":581366663.0,
    "views":7.0,
    "volume":581366663.0
  },
  "updated":"2000-01-01T14:15:16.325605+00:00"
}
briri commented 1 year ago

Dryad allows searching for 'published' datasets and allows you to filter the results by author affiliation.

Dryad of course contains RORs and ORCIDs 🎉

There does not seem to be a way to search by ORCID.

For example:

curl -X 'GET' \
  'https://datadryad.org/api/v2/search?q=molecular&affiliation=https%3A%2F%2Fror.org%2F01an7q238' \
  -H 'accept: application/json'

# Result
# ----------------------
{
  "_links": {
    "self": {
      "href": "/api/v2/datasets/doi%3A10.12345%2Fdryad.abc12"
    },
    "stash:versions": {
      "href": "/api/v2/datasets/doi%3A10.12345%2Fdryad.abc12/versions"
    },
    "stash:version": {
      "href": "/api/v2/versions/1234"
    },
    "stash:download": {
      "href": "/api/v2/datasets/doi%3A10.12345%2Fdryad.abc12/download"
    },
    "curies": [
      {
        "name": "stash",
        "href": "https://github.com/CDL-Dryad/stash/blob/main/stash_api/link-relations.md#{rel}",
        "templated": "true"
      }
    ]
  },
  "identifier": "doi:10.12345%2Fdryad.abc12",
  "id": 12345,
  "storageSize": 1032385573,
  "relatedPublicationISSN": "1234-123X",
  "title": "Data from: Measuring ectoplasm amounts left by Slimer",
  "authors": [
    {
      "firstName": "Jane C.",
      "lastName": "Doe",
      "affiliation": "University of Minnesota",
      "affiliationROR": "https://ror.org/017zqws13"
    },
    {
      "firstName": "John Jacob",
      "lastName": "Jingle Hymer-Smith",
      "email": "john@example.com",
      "affiliation": "University of California, Berkeley",
      "affiliationROR": "https://ror.org/01an7q238",
      "orcid": "0000-0000-0000-0000"
    }
  ],
  "abstract": "Dispersal plays a prominent role in how gross it feels to be slimed.",
  "keywords": [
    "dispersal limitation",
    "Metacommunities"
  ],
  "usageNotes": "Use with caution!",
  "relatedWorks": [
    {
      "relationship": "primary_article",
      "identifierType": "DOI",
      "identifier": "https://doi.org/10.1234/j.4567zyx.2000.9876.a"
    }
  ],
  "versionNumber": 1,
  "versionStatus": "submitted",
  "curationStatus": "Published",
  "versionChanges": "none",
  "publicationDate": "2000-01-01",
  "lastModificationDate": "2000-01-01",
  "visibility": "public",
  "sharingLink": "https://datadryad.org/stash/share/0000000000aaaaaaaaaaaaa",
  "userId": 12345,
  "license": "https://creativecommons.org/publicdomain/zero/1.0/"
}
briri commented 1 year ago

Review complete. Leaving this one open though so we can reference when we decide to implement integrations for these APIs

pdurbin commented 1 year ago

I am not seeing ORCID or ROR identifiers in the output

@briri hi, thanks for kicking the tires on the Dataverse Search API! 🎉

It's not very intuitive but you can get ORCIDs out of the Search API. If you pass metadata_fields=citation:author, for example (docs), you can get more details about that field (author). Below is an example where you can see an ORCID. We need to make this easier, obviously. 😅

We have ROR support for our author affiliation field but haven't rolled it out to our demo server yet. You can track this here:

For a list of searchable fields, this might help: https://demo.dataverse.org/api/metadatablocks/citation

Please feel free to ask questions at https://chat.dataverse.org or https://groups.google.com/g/dataverse-community

curl 'https://demo.dataverse.org/api/search?q=F8QXRU&metadata_fields=citation:author'

{
  "status": "OK",
  "data": {
    "q": "F8QXRU",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "The History of Coffee",
        "type": "dataset",
        "url": "https://doi.org/10.70122/FK2/F8QXRU",
        "global_id": "doi:10.70122/FK2/F8QXRU",
        "description": "Description text",
        "published_at": "2023-01-12T19:31:16Z",
        "publisher": "Dataverse de Exemplo Lepidus",
        "citationHtml": "admin, admin; Castanheiras, Iris, 2023, \"The History of Coffee\", <a href=\"https://doi.org/10.70122/FK2/F8QXRU\" target=\"_blank\">https://doi.org/10.70122/FK2/F8QXRU</a>, Demo Dataverse, V1, UNF:6:dEgtc5Z1MSF3u7c+kF4kXg== [fileUNF]",
        "identifier_of_dataverse": "dataverseDeExemplo",
        "name_of_dataverse": "Dataverse de Exemplo Lepidus",
        "citation": "admin, admin; Castanheiras, Iris, 2023, \"The History of Coffee\", https://doi.org/10.70122/FK2/F8QXRU, Demo Dataverse, V1, UNF:6:dEgtc5Z1MSF3u7c+kF4kXg== [fileUNF]",
        "storageIdentifier": "s3://10.70122/FK2/F8QXRU",
        "keywords": [
          "Documentary"
        ],
        "subjects": [
          "Agricultural Sciences"
        ],
        "fileCount": 1,
        "versionId": 224817,
        "versionState": "RELEASED",
        "majorVersion": 1,
        "minorVersion": 0,
        "createdAt": "2023-01-12T14:26:17Z",
        "updatedAt": "2023-01-12T19:31:16Z",
        "contacts": [
          {
            "name": "Conta de Desenvolvimento para Testes",
            "affiliation": ""
          }
        ],
        "publications": [
          {
            "citation": "admin, a., &amp; Castanheiras, I. (2023). <em>The History of Coffee</em>. Lepidus"
          }
        ],
        "metadataBlocks": {
          "citation": {
            "displayName": "Citation Metadata",
            "fields": [
              {
                "typeName": "author",
                "multiple": true,
                "typeClass": "compound",
                "value": [
                  {
                    "authorName": {
                      "typeName": "authorName",
                      "multiple": false,
                      "typeClass": "primitive",
                      "value": "admin, admin"
                    }
                  },
                  {
                    "authorName": {
                      "typeName": "authorName",
                      "multiple": false,
                      "typeClass": "primitive",
                      "value": "Castanheiras, Iris"
                    },
                    "authorAffiliation": {
                      "typeName": "authorAffiliation",
                      "multiple": false,
                      "typeClass": "primitive",
                      "value": "Lepidus"
                    },
                    "authorIdentifierScheme": {
                      "typeName": "authorIdentifierScheme",
                      "multiple": false,
                      "typeClass": "controlledVocabulary",
                      "value": "ORCID"
                    },
                    "authorIdentifier": {
                      "typeName": "authorIdentifier",
                      "multiple": false,
                      "typeClass": "primitive",
                      "value": "0000-0002-1825-0097"
                    }
                  }
                ]
              }
            ]
          }
        },
        "authors": [
          "admin, admin",
          "Castanheiras, Iris"
        ]
      }
    ],
    "count_in_response": 1
  }
}
briri commented 1 year ago

thanks @pdurbin this is very helpful!

briri commented 3 months ago

somewhat related to #77

briri commented 3 months ago

closing as our investigation is done. will create new tickets if we decide to build integrations/harvesters