CDLUC3 / dmsp_aws_prototype

Sceptre CloudFormation templates for DMPHub v2
MIT License
1 stars 0 forks source link

Setup the DataCite harvester #104

Closed briri closed 5 months ago

briri commented 8 months ago

Need to update the old DataCite harvester code so that it:

briri commented 7 months ago

We should make a pass to find the DMP ID itself in the DataCite system and check if it has relatedIdentifiers. No need for an admin to verify those connections

briri commented 6 months ago

Sent an email to DataCite support. Their pagination cursor is not working. It is able to provide a start and end cursor as well as the total count of works. It errors though when requesting the current cursor. I have updated it to pull the first 500 works.

query affiliationQuery {
  organization(id: https://ror.org/01an7q238) {
    id
    name
    alternateName
    works(query: "created: [2023-10-01 TO 2023-11-01]" first: 635, after: "MTY5NjE5NzQ0NDAwMCwxMC43OTIyL2cyMW43emdw") {
      totalCount
      pageInfo {
        startCursor
        endCursor
        hasNextPage
      }
      edges {
        cursor
      }
      nodes {
        id
        doi
        type
      }
    }
}
briri commented 6 months ago

Had to switch to use the DataCite REST API because we are working with older data so we do not have useful entry points in the GraphQL API (e.g. we have a ROR but the records we are after do not).

We found some success with the REST API. It found related works for about half of the uploaded DMPs. The solution is problematic though and we will need to keep working at it to find a balance.

What's wrong?

This all means that we are stuck with matching on a single PI name. It worked for this first round of testing but what about when we have a "Jane Smith"? We will end up with too many matches.