elifesciences / eLife-JATS-schematron

Schematron for all JATS eLife content
MIT License
3 stars 2 forks source link

Match Funders in Acknowledgements to Open Funder Registry #80

Closed Melissa37 closed 4 years ago

Melissa37 commented 4 years ago

Background

Authors traditionally have added their funding to their acknowledgments and it is a relatively new thing to have a separate funding section (which not all publishers do yet anyway). This means sometimes our authors do not add their funding details to the section provided in EJP OR they fill in the funding section and retain funding information in the acknowledgements

Describe what you would like tested

The Open Funder Registry is found on GitLab: https://github.com/Crossref/open-funder-registry. It is a downloadable file. It is updated on an ad hoc basis, so to ensure a local file is updated this repo will need to be followed.

elements affected

<ack id="ack">

<funding-group>
                <award-group id="fund1">
                    <funding-source>
                        <institution-wrap><institution-id institution-id-type="FundRef">https://dx.doi.org/10.13039/100000011</institution-id>
                        <institution>Howard Hughes Medical Institute</institution>
...

Suggested schematron message

A funder in the open funder registry is mentioned in the acknowledgments but not listed in the funding section. Please check

Suggested role (warning or error)

warning (because it might be appropriate they are not listed as a direct funder) eLife will have to provide examples of when not to add the funder to the funder list in the Wiki

Stage

pre-edit

Example

None as yet

fred-atherden commented 4 years ago

Note to self:

The following XQuery returns a list of funder names:

let $fundref := fetch:xml('https://gitlab.com/crossref/open_funder_registry/raw/master/registry.rdf')

for $x in $fundref//*:Concept
return $x//*:literalForm/data()
fred-atherden commented 4 years ago

The list of funders is so massive that I don't think Schematron is the correct way for this to checked. It would mean that validation takes so long that it would become unusable.

Going to explore using basex instead and integrating this with the basex validation module that we are testing.

fred-atherden commented 4 years ago

Related to this but not addressing the specific problem, I suggest that we need an error test to identify a scenario in which there are two (or more) funding entries with the same funder, but only one has a fundref id.

Anecdotally I've seen this in production myself a couple of times, so it should serve as useful and go some (small) way to mitigate the problem here.

fred-atherden commented 4 years ago

Added test in. This checks for the presence of the preferred label for each funder in Fundref and fires if their doi is not in the funding-group.

Using regex in this context is problematic - so much memory is used that the whole schematron becomes unusable - therefore the function contains() is used. This means the check is quick but it carries with it limitations, which I have listed below.

I'm going to close this ticket for now. We can re-open if there's a need to refine this test.

Labels

Funders have numerous different names. These are defined in that file as preferred or alternative labels. This check only for the presence of preferred labels, i.e. National Science Foundation, not NSF. We could add these in manually if needed, but allowing all variants leads to this being flagged far too often to be useful (some alternative labels are 'as', 'why' etc.)

House style

It's reliant on the implementation of our house style in the acknowledgements (removal of full stops and unnecessary spaces from funder names) For example

W. E. B. Du Bois Institute for African and African American Research, Harvard University

wouldn't flag but

WEB Du Bois Institute for African and African American Research, Harvard University

would.

Casing

Casing has to be the same as it is in fundref - i.e. National Science Foundation instead of National science foundation. Again, this is because if casing was ignored, the rule would fire so often that it's usefulness would deteriorate.