Closed codycooperross closed 1 year ago
This fix looks good :)
We might want to consider still calling the normalize_id
function for types other than Crossref Funder and ROR, as it also cleans up URLs when present (removes UTM fragments etc, see https://github.com/postrank-labs/postrank-uri#example), which might be a useful thing to keep.
In that case, my proposition would be to restore the
if funder_identifier_type != "Other"
funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)
lines and make a modification to normalize_id
here:
https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L623
to return id
instead of nil, which essentially means that URLs still get cleaned up, but any non-URL value passed will just be returned as-is when it fails the if statement on that line.
Looking at normalize_id
elsewhere, it seems like the function is used to normalize/validate URLs specifically:
Also, some of the tests expect nil values:
Maybe a separate function would be best? Although the naming of normalize_id
certainly confuses things.
Looking at
normalize_id
elsewhere, it seems like the function is used to normalize/validate URLs specifically:
How frustrating when we have the normalize_url
function right there!
Also, some of the tests expect nil values:
Maybe a separate function would be best? Although the naming of
normalize_id
certainly confuses things.
Two parts to this, I think:
1) When the value is clearly expected to always be a URL (e.g the schema.org processing, where the next thing that happens with the return from the function is an HTTP request with Maremma), the code should change to use normalize_url
instead of normalize_id
2) For things such as that test, where the ID may or may not be a URL, and should return nil
if the ID is a URL but not HTTP or HTTPS, then a different change to normalize_id
, where this code:
https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L621-L623
is split out into a slightly different logic flow:
nil
When attempting to implement this logic flow:
Attempt to parse URL
If valid URL, check the scheme
If not http or https, return
nil
If http or https, proceed to URL cleanup
If not valid URL, return as is
. . . it seems the following code gets flummoxed since it depends on input
being nil if input
is not a valid URL:
This results in a few tests failing, like this one:
Rather than refactoring metadata
, I opted for a less elegant normalization step in datacite_reader
. It doesn't accomplish
- If not http or https, return
nil
. . . but it does normalize http and https URLs. @digitaldogsbody , do you think more work needs to be done here?
. @digitaldogsbody , do you think more work needs to be done here?
I think this is a good solution for now. My only outstanding question would be about no longer passing scheme_uri
when it is present, as it did with the original behaviour:
if funder_identifier_type != "Other"
funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)
You might consider adding that logic back into the code in your else
block here: https://github.com/datacite/bolognese/pull/143/files#diff-4cfebbeb62130d0a784d3dcc28af41aed547f5bd4da1ec9e4600419e0c4638b0R167, just in case it is depended on somewhere that isn't well tested.
My only outstanding question would be about no longer passing
scheme_uri
when it is present, as it did with the original behaviour:if funder_identifier_type != "Other" funder_identifier = !funder_identifier.to_s.start_with?("https://","http://") && scheme_uri.present? ? normalize_id(scheme_uri + funder_identifier) : normalize_id(funder_identifier)
This scheme_uri
behavior is considered incorrect or a bug at this point. We're also removing it from name_identifier
here: https://github.com/datacite/bolognese/pull/137
Purpose
When reading DataCite XML, bolognese would not return the funder identifier in a funding reference if the funder identifier did not originally appear as a URL.
closes: #142
Approach
When processing
funding_references
, a normalization step would attempt to render thefunder_identifier
as a URL:https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/readers/datacite_reader.rb#L161
If the schemeURI was present,
normalize_id
would be called on a concatenation of the schemeURI and funder identifier. Otherwise,normalize_id
would be called on the funder identifier. Ifnormalize_id
could not render either as a URI,funder_identifier
would become nil:https://github.com/datacite/bolognese/blob/dc8170d2f44271bc20a9c649c10130422007a4bf/lib/bolognese/utils.rb#L614
This PR normalizes Crossref Funder IDs and RORs but performs no normalization steps otherwise, avoiding a
funder_identifier
nil value. It also adds the optionalschemeUri
attribute to the rendered funding reference, which was originally not included.Open Questions and Pre-Merge TODOs
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
Reviewer, please remember our guidelines: