datacite / bolognese

Ruby gem and command-line utility for conversion of DOI metadata
MIT License
40 stars 14 forks source link

Affiliation identifiers are not normalized consistently when reading from DataCite XML #152

Closed codycooperross closed 1 year ago

codycooperross commented 1 year ago

Describe the bug

When reading from DataCite XML, affiliation identifiers are normalized as either 1) a concatenation of the schemeURI and affiliation identifier or 2) as a URL, if the affiliation identifier starts with "https:/". Identifiers not rendered as URLs are discarded.

Expected Behaviour

Affiliation identifiers are normalized according to their identifierScheme. Other fields, like funder identifier and name identifier, are normalized according to their identifierScheme, resulting in normalizations that operate in more scenarios for identifiers like RORs and ORCIDs

Current Behaviour

Identifiers are not normalized according to their identifierScheme, and identifiers not rendered as URLs are discarded.

Steps to Reproduce

ROR affiliation identifiers without the "https://ror.org" URL or a schemeURI (ex. 05dxps055) are discarded when read by bolognese.

            <affiliation affiliationIdentifier="05dxps055" affiliationIdentifierScheme="ROR">California Institute of Technology</affiliation>
      "affiliation": [
        {
          "name": "California Institute of Technology",
          "affiliationIdentifierScheme": "ROR"
        }
      ], 

Context (Environment)

This issue affects DataCite JSON and API responses when metadata is submitted as XML.

Proposal

Possible Implementation

Affiliation identifier normalization could mirror name identifier normalization here:

https://github.com/datacite/bolognese/blob/b0a7df3c9dd6a45eaf56fd0e06d304e4db9b837d/lib/bolognese/author_utils.rb#L32

Front logo Front conversations

richardhallett commented 1 year ago

The goal here should be to never discard regardless if url or not. I think the proposal of using something like how name_identifiers works makes sense to me.

ashwinisukale commented 1 year ago

I was able to reproduce this bug locally with following steps.

This works

1) From this fixture file remove schemeURI attribute and keep affiliationIdentifier with URL https://github.com/datacite/bolognese/blob/b0a7df3c9dd6a45eaf56fd0e06d304e4db9b837d/spec/fixtures/datacite-example-ROR-nameIdentifiers.xml#L9

After processing this XML file at bolognese/spec/author_utils_spec.rb:163. We will see that creators with affiliation will have affiliationIdentifier in the response after processing this metadata.

(byebug) subject.creators[0]
{"nameType"=>"Personal", "name"=>"Robinson, Erin", "givenName"=>"Erin", "familyName"=>"Robinson", "nameIdentifiers"=>[{"schemeUri"=>"https://orcid.org", "nameIdentifierScheme"=>"ORCID"}], "affiliation"=>[{"name"=>"Metadata Game Changers", "affiliationIdentifier"=>"https://ror.org/05bp8ka05", "affiliationIdentifierScheme"=>"ROR"}]}

This won't work

2) From this fixture file remove schemeURI attribute and keep affiliationIdentifier without URL like below,

<creator>
    <creatorName nameType="Personal">Erin Robinson</creatorName>
    <nameIdentifier schemeURI="https://orcid.org/" nameIdentifierScheme="ORCID"> https://orcid.org/0000-0001-9998-0114 </nameIdentifier>
    <affiliation affiliationIdentifier="05bp8ka05" affiliationIdentifierScheme="ROR"> Metadata Game Changers </affiliation>
</creator>

Now in the test file add byebug bolognese/spec/author_utils_spec.rb:163 and check subject after processing the schema. we will see affiliation attribute in the response does not have affiliationIdentifier.

(byebug) subject.creators[0]
{"nameType"=>"Personal", "name"=>"Robinson, Erin", "givenName"=>"Erin", "familyName"=>"Robinson", "nameIdentifiers"=>[{"schemeUri"=>"https://orcid.org", "nameIdentifierScheme"=>"ORCID"}], "affiliation"=>[{"name"=>"Metadata Game Changers", "affiliationIdentifierScheme"=>"ROR"}]}