Fixes normalization behavior that omitted non-URL funding identifiers when reading from DataCite XML. Adds normalization for Crossref Funder ID and ROR funderIdentifierTypes.

codycooperross commented 1 year ago

Purpose

When reading DataCite XML, bolognese would not return the funder identifier in a funding reference if the funder identifier did not originally appear as a URL.

closes: #142

Approach

When processing funding_references, a normalization step would attempt to render the funder_identifier as a URL:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/datacite_reader.rb#L161

If the schemeURI was present, normalize_id would be called on a concatenation of the schemeURI and funder identifier. Otherwise, normalize_id would be called on the funder identifier. If normalize_id could not render either as a URI, funder_identifier would become nil:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L614

This PR normalizes Crossref Funder IDs and RORs but performs no normalization steps otherwise, avoiding a funder_identifier nil value. It also adds the optional schemeUri attribute to the rendered funding reference, which was originally not included.

Open Questions and Pre-Merge TODOs

[ ] Does including schemeUri here have any downstream issues?
[ ] What changes in lupo are necessary to properly import and index fundingReference schemeUris?

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

Be humble in the language and feedback you give, ask don't tell.
Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
Offer suggestions on how to improve code e.g. simplification or expanding clarity.
Ensure you give reasons for the changes you are proposing.

digitaldogsbody commented 1 year ago

This fix looks good :)

We might want to consider still calling the normalize_id function for types other than Crossref Funder and ROR, as it also cleans up URLs when present (removes UTM fragments etc, see https://github.com/postrank-labs/postrank-uri#example), which might be a useful thing to keep.

In that case, my proposition would be to restore the

if funder_identifier_type != "Other"
            funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)

lines and make a modification to normalize_id here: https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L623 to return id instead of nil, which essentially means that URLs still get cleaned up, but any non-URL value passed will just be returned as-is when it fails the if statement on that line.

codycooperross commented 1 year ago

Looking at normalize_id elsewhere, it seems like the function is used to normalize/validate URLs specifically:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/schema_org_reader.rb#L29

Also, some of the tests expect nil values:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/spec/utils_spec.rb#L175

Maybe a separate function would be best? Although the naming of normalize_id certainly confuses things.

digitaldogsbody commented 1 year ago

Looking at normalize_id elsewhere, it seems like the function is used to normalize/validate URLs specifically:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/schema_org_reader.rb#L29

How frustrating when we have the normalize_url function right there!

Also, some of the tests expect nil values:

https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/spec/utils_spec.rb#L175

Maybe a separate function would be best? Although the naming of normalize_id certainly confuses things.

Two parts to this, I think:

1) When the value is clearly expected to always be a URL (e.g the schema.org processing, where the next thing that happens with the return from the function is an HTTP request with Maremma), the code should change to use normalize_url instead of normalize_id

2) For things such as that test, where the ID may or may not be a URL, and should return nil if the ID is a URL but not HTTP or HTTPS, then a different change to normalize_id, where this code: https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L621-L623

is split out into a slightly different logic flow:

Attempt to parse URL
- If valid URL, check the scheme
- If not http or https, return nil
- If http or https, proceed to URL cleanup
- If not valid URL, return as is

codycooperross commented 1 year ago

When attempting to implement this logic flow:

Attempt to parse URL

If valid URL, check the scheme

If not http or https, return nil

If http or https, proceed to URL cleanup

If not valid URL, return as is

. . . it seems the following code gets flummoxed since it depends on input being nil if input is not a valid URL:

https://github.com/datacite/bolognese/blob/e16d372a8e72f26f9aa18d835464eb7d6124a9f8/lib/bolognese/metadata.rb#L20

This results in a few tests failing, like this one:

https://github.com/datacite/bolognese/blob/e16d372a8e72f26f9aa18d835464eb7d6124a9f8/spec/readers/ris_reader_spec.rb#L16

Rather than refactoring metadata, I opted for a less elegant normalization step in datacite_reader. It doesn't accomplish

If not http or https, return nil

. . . but it does normalize http and https URLs. @digitaldogsbody , do you think more work needs to be done here?

digitaldogsbody commented 1 year ago

. @digitaldogsbody , do you think more work needs to be done here?

I think this is a good solution for now. My only outstanding question would be about no longer passing scheme_uri when it is present, as it did with the original behaviour:

if funder_identifier_type != "Other"
            funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)

You might consider adding that logic back into the code in your else block here: https://github.com/datacite/bolognese/pull/143/files#diff-4cfebbeb62130d0a784d3dcc28af41aed547f5bd4da1ec9e4600419e0c4638b0R167, just in case it is depended on somewhere that isn't well tested.

codycooperross commented 1 year ago

My only outstanding question would be about no longer passing scheme_uri when it is present, as it did with the original behaviour:
if funder_identifier_type != "Other"
            funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)

This scheme_uri behavior is considered incorrect or a bug at this point. We're also removing it from name_identifier here: https://github.com/datacite/bolognese/pull/137

datacite / bolognese