datacite / bolognese

Ruby gem and command-line utility for conversion of DOI metadata
MIT License
40 stars 14 forks source link

Fixes normalization behavior that omitted non-URL funding identifiers when reading from DataCite XML. Adds normalization for Crossref Funder ID and ROR funderIdentifierTypes. #143

Closed codycooperross closed 1 year ago

codycooperross commented 1 year ago

Purpose

When reading DataCite XML, bolognese would not return the funder identifier in a funding reference if the funder identifier did not originally appear as a URL.

closes: #142

Approach

When processing funding_references, a normalization step would attempt to render the funder_identifier as a URL:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/datacite_reader.rb#L161

If the schemeURI was present, normalize_id would be called on a concatenation of the schemeURI and funder identifier. Otherwise, normalize_id would be called on the funder identifier. If normalize_id could not render either as a URI, funder_identifier would become nil:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L614

This PR normalizes Crossref Funder IDs and RORs but performs no normalization steps otherwise, avoiding a funder_identifier nil value. It also adds the optional schemeUri attribute to the rendered funding reference, which was originally not included.

Open Questions and Pre-Merge TODOs

Types of changes

Reviewer, please remember our guidelines:

digitaldogsbody commented 1 year ago

This fix looks good :)

We might want to consider still calling the normalize_id function for types other than Crossref Funder and ROR, as it also cleans up URLs when present (removes UTM fragments etc, see https://github.com/postrank-labs/postrank-uri#example), which might be a useful thing to keep.

In that case, my proposition would be to restore the

if funder_identifier_type != "Other"
            funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)

lines and make a modification to normalize_id here: https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L623 to return id instead of nil, which essentially means that URLs still get cleaned up, but any non-URL value passed will just be returned as-is when it fails the if statement on that line.

codycooperross commented 1 year ago

Looking at normalize_id elsewhere, it seems like the function is used to normalize/validate URLs specifically:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/schema_org_reader.rb#L29

Also, some of the tests expect nil values:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/spec/utils_spec.rb#L175

Maybe a separate function would be best? Although the naming of normalize_id certainly confuses things.

digitaldogsbody commented 1 year ago

Looking at normalize_id elsewhere, it seems like the function is used to normalize/validate URLs specifically:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/schema_org_reader.rb#L29

How frustrating when we have the normalize_url function right there!

Also, some of the tests expect nil values:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/spec/utils_spec.rb#L175

Maybe a separate function would be best? Although the naming of normalize_id certainly confuses things.

Two parts to this, I think:

1) When the value is clearly expected to always be a URL (e.g the schema.org processing, where the next thing that happens with the return from the function is an HTTP request with Maremma), the code should change to use normalize_url instead of normalize_id

2) For things such as that test, where the ID may or may not be a URL, and should return nil if the ID is a URL but not HTTP or HTTPS, then a different change to normalize_id, where this code: https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L621-L623

is split out into a slightly different logic flow:

codycooperross commented 1 year ago

When attempting to implement this logic flow:

  • Attempt to parse URL

    • If valid URL, check the scheme

    • If not http or https, return nil

    • If http or https, proceed to URL cleanup

    • If not valid URL, return as is

. . . it seems the following code gets flummoxed since it depends on input being nil if input is not a valid URL:

https://github.com/datacite/bolognese/blob/e16d372a8e72f26f9aa18d835464eb7d6124a9f8/lib/bolognese/metadata.rb#L20

This results in a few tests failing, like this one:

https://github.com/datacite/bolognese/blob/e16d372a8e72f26f9aa18d835464eb7d6124a9f8/spec/readers/ris_reader_spec.rb#L16

Rather than refactoring metadata, I opted for a less elegant normalization step in datacite_reader. It doesn't accomplish

  • If not http or https, return nil

. . . but it does normalize http and https URLs. @digitaldogsbody , do you think more work needs to be done here?

digitaldogsbody commented 1 year ago

. @digitaldogsbody , do you think more work needs to be done here?

I think this is a good solution for now. My only outstanding question would be about no longer passing scheme_uri when it is present, as it did with the original behaviour:

if funder_identifier_type != "Other"
            funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)

You might consider adding that logic back into the code in your else block here: https://github.com/datacite/bolognese/pull/143/files#diff-4cfebbeb62130d0a784d3dcc28af41aed547f5bd4da1ec9e4600419e0c4638b0R167, just in case it is depended on somewhere that isn't well tested.

codycooperross commented 1 year ago

My only outstanding question would be about no longer passing scheme_uri when it is present, as it did with the original behaviour:

if funder_identifier_type != "Other"
            funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)

This scheme_uri behavior is considered incorrect or a bug at this point. We're also removing it from name_identifier here: https://github.com/datacite/bolognese/pull/137