NCEAS / metadig-checks

MetaDIG suites and checks for data and metadata improvement and guidance.
Apache License 2.0
9 stars 9 forks source link

resource.awardFunderName.controlled #421

Closed gothub closed 2 years ago

gothub commented 2 years ago

Description

Check if an award funder name has been specified and is contained within a controlled list.

Priority

Issues

Procedure

Obtain the funder name from the metadata and search for it in the CrossRef Funder API which is described here

gothub commented 2 years ago

Crossref provides releases of their funder list, which includes an RDF format file and a CVS file. The current release includes 29,743 funders.

Should the check read from a downloaded file that is updated periodically, or call the Crossref funder api?

gothub commented 2 years ago

@vchendrix @JEDamerow I've only checked a few ESS-DIVE metadata records, and it looks like '` element is used, for example:

      <funding>
        <para>DOE:AC0500OR22725</para>
      </funding>
    </project>

Do you have plans in the future of using the new EML 2.2 funding element for funder name, e.g.:

<project>
   ...
   <funding><para>Funding is from a grant from the National Science Foundation.</para></funding>
   <award>
      <funderName>National Science Foundation</funderName>
      <funderIdentifier>https://doi.org/10.13039/00000001</funderIdentifier>
      <awardNumber>1546024</awardNumber>
      <title>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</title>
      <awardUrl>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</awardUrl>
   </award>
</project>

Also, the element shown above appears to be an award number, not a funder name. Is there a different API/database that I should be searching.

mbjones commented 2 years ago

I think shifting to the structured award field would be preferred, as it will allow for much more effective faceted search than the narrative funding/para field. We made a similar switch for other repos, and our dispolay system should be able to show both.

As these are DOE awards, @gothub, I wonder if the DOE award search would work?

https://pamspublic.science.energy.gov/WebPAMSExternal/Interface/Awards/AwardSearchExternal.aspx

gothub commented 2 years ago

@mbjones thx for the DOE award search link.

@vchendrix I tried a couple of award numbers, to see if this is the correct database to use, but didn't get any hits from the above search link. Do I need to have a login to get results? Here are a couple of award numbers I tried, that were obtained from a Solr query:

vchendrix commented 2 years ago

I defer to @JEDamerow on this +++++++++++++++++++++++++++++++++ Val Hendrix @.*** Lawrence Berkeley National Lab

Mail Stop: 50B-2239 Room: 50B-2258E Phone: (510) 495-2905 Pronouns: she/hers +++++++++++++++++++++++++++++++++

On Wed, Dec 8, 2021 at 1:53 PM Peter Slaughter @.***> wrote:

@Val https://github.com/Val The R package you suggested appears to work well for fuzzy matches between /eml/dataset/project/title and the projectTitle entries in https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json.

The check will find the closest match. If the match is exact, the output message will be:

The project title was found in the list of known project titles.

If an exact match was not found, the output message will be

The project title was not found in the list of known project titles. The closest match was "<insert closest match here".

The closest match found can be wildly different, depending on length of the title and if anything actually similar exists. Here are a couple of titles obtained from ESS-DIVE Solr and the controlled list, along with the calculated values

Project Title: WHONDRS Closest dist: 8, closest match: ExaSheds percent diff: 1.142857

Project Title: Trace Metal Dynamics and Limitations on Biogeochemical Cycling in Wetland Soils and Hyporheic Zones, PI Jeffrey G. Catalano Closest dist: 24, closest match: Trace Metal Dynamics and Limitations on Biogeochemical Cycling in Wetland Soils and Hyporheic Zones percent diff: 0.195122

Project Title: SPRUCE Closest dist: 8, closest match: ExaSheds percent diff: 1.333333

Project Title: Free-Air CO2 Enrichment Model Data Synthesis Closest dist: 12, closest match: Free Air CO2 Enrichment Model Data Synthesis (FACE-MDS) percent diff: 0.272727

Please let me know if this is sufficient, or if you have any ideas on what to print or calculate when there is not a close match.

The values above are calculated as

dist <- stringdist(projectTitles[[iproj]], controlledProjectTitles[[ictrl]], method=c("lv"))
percentDiff <- dist/nchar(projectTitles[[iproj]])

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/metadig-checks/issues/421#issuecomment-989238218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL4M4LKQV5RBZ3JPMHB6ILUP7HVPANCNFSM5GHHXN4Q .

JEDamerow commented 2 years ago

We decided to make this a simple check for now, just providing a warning if they do not include the following funding source: "U.S. DOE > Office of Science > Biological and Environmental Research (BER)"

gothub commented 2 years ago

@JEDamerow which EML element should the check look for?

Looking at a couple of recent ESS-DIVE datasets:

The funder you mentioned is in:

<associatedParty>
  <organizationName>U.S. DOE > Office of Science > Biological and Environmental Research (BER)</organizationName>
  <userId directory="unknown">http://dx.doi.org/10.13039/100006206</userId>
  <role>fundingOrganization</role>
</associatedParty>

Whereas the typically specified funder is in another element in these datasets:

<project>
   ...
  <funding>
    <para>DOE:DEAC0500OR22725</para>
  </funding>
  ...
</project>

and

<project>
  ...
  <funding>
    <para>DOE:DEAC0205CH11231 (Lawrence Berkeley National Laboratory)</para>
  </funding>
  ...
</project>

And in EML 2.2.0 the funder can be specified as something like:

<project>
   ...
   <funding><para>Funding is from a grant from the National Science Foundation.</para></funding>
   <award>
      <funderName>National Science Foundation</funderName>
      <funderIdentifier>https://doi.org/10.13039/00000001</funderIdentifier>
      <awardNumber>1546024</awardNumber>
      <title>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</title>
      <awardUrl>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</awardUrl>
   </award>
</project>

So, points to consider:

JEDamerow commented 2 years ago

@gothub The eml elements you are referring to come from our existing metadata documents? I think that we would want to check wherever this DOE funding organization is mentioned in our eml documents. This is what it looks like in our UI where people enter the funder, and BER is one of the controlled terms.

image

gothub commented 2 years ago

I think that the 'associatedParty' element is being used by ESS-DIVE to store the funder name, but y'all will have to verify that. I'd like to make sure of this. This info can also be stored in the funding element and with With EML 2.2.0, this funder info can also be stored in the <award> element (see https://eml.ecoinformatics.org/whats-new-in-eml-2-2-0.html#structured-funding-information).

If the check will be using associatedParty, I can select the element(s) with role=fundingOrganization, in case there are currently or will be additional entries in the future for other associatedParties that have roles other than funding organization.

So, should I check for associatedParty with role = fundingOrganization?

JEDamerow commented 2 years ago

@vchendrix Do you have any input on the metadata elements that we use to indicate the funder, and how that may differ from how others normally indicate funder according to above? It seems that we use "associated party" instead of the "funding" element? IF there is a reason, I don't really care, we just need to clarify that is the case and that would be where we check that the funder includes BER, which is what Charu thought we should do for this check.

gothub commented 2 years ago

Initial version added in commit https://github.com/NCEAS/metadig-checks/commit/28aeb09f9d476534e5498b9c4494e843893da12a

gothub commented 2 years ago

Reopened - will close after @vchendrix comments.

vchendrix commented 2 years ago

@JEDamerow @gothub I just noticed this thread (I was out yesterday). We add the funder information to one ore more associatedParty elements in EML. This is because we didn't want to tie the funder to the project as there may be multiple funders and they may or may not be associated with the project.

<associatedParty id="6298334361820877">
<organizationName>U.S. DOE &#x3E; Office of Science &#x3E; Early Career Research Program</organizationName><role>fundingOrganization</role>
</associatedParty>

The contract numbers are added to <funding/> in the <project/> element.

<project>
<title>Early Career Research Program: Watershed Perturbation-Response Traits Derived Through Ecological Theory - Worldwide Hydrobiogeochemistry Observation Network for Dynamic River Systems (WHONDRS)</title>
<personnel id="4080305509860062">
<organizationName>Early Career Research Program: Watershed Perturbation-Response Traits Derived Through Ecological Theory - Worldwide Hydrobiogeochemistry Observation Network for Dynamic River Systems (WHONDRS)</organizationName>
<role>metadataProvider</role></personnel>
<funding>
<para>DOE:DOE Award #74193</para></funding>
</project>

Hope that helps

gothub commented 2 years ago

The check now properly inspects all associatedParty entries for the approved funding agency entry.