NCEAS / metadig-checks

MetaDIG suites and checks for data and metadata improvement and guidance.
Apache License 2.0
8 stars 9 forks source link

resource.projectTitle.controlled #420

Closed gothub closed 1 year ago

gothub commented 3 years ago

Description

Check if DOE Project name associated with the data package comes from controlled list.

Priority

Issues

Procedure

Get project name from metadata and check that the project name is included within a controlled list. Initially this controlled list will be provided from https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json, but in the future the controlled list check could be performed by calling an API provided by ESS-DIVE.

The check will fail if an exact match is not found for the project name (title).

Requested check messages

On failure: "Warning. The DOE project name listed is not from the controlled list of projects. When entering project name, use the autocomplete feature to choose from the existing projects. If you can not find your project name, try entering the PI name."

gothub commented 3 years ago

Note that this check is referenced in this ESS-DIVE issue. It is also mentioned in this ESS-DIVE issue.

gothub commented 3 years ago

This check has been renamed to resource.projectTitle.controlled, as the XML element being tested is /eml/dataset/project/title.

gothub commented 2 years ago

@vchendrix The R package you suggested appears to work well for fuzzy matches between /eml/dataset/project/title and the projectTitle entries in https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json.

The check will find the closest match. If the match is exact, the output message will be:

The project title was found in the list of known project titles.

If an exact match was not found, the output message will be

The project title was not found in the list of known project titles.
The closest match was "<insert closest match here".

The closest match found can be wildly different, depending on length of the title and if anything actually similar exists. Here are a couple of titles obtained from ESS-DIVE Solr and the controlled list, along with the calculated values

Project Title: WHONDRS
Closest dist: 8, closest match: ExaSheds
percent diff: 1.142857

Project Title: Trace Metal Dynamics and Limitations on Biogeochemical Cycling in Wetland Soils and Hyporheic Zones, PI Jeffrey G. Catalano
Closest dist: 24, closest match: Trace Metal Dynamics and Limitations on Biogeochemical Cycling in Wetland Soils and Hyporheic Zones
percent diff: 0.195122

Project Title: SPRUCE
Closest dist: 8, closest match: ExaSheds
percent diff: 1.333333

Project Title: Free-Air CO2 Enrichment Model Data Synthesis
Closest dist: 12, closest match: Free Air CO2 Enrichment Model Data Synthesis (FACE-MDS)
percent diff: 0.272727

Please let me know if this is sufficient, or if you have any ideas on what to print or calculate when there is not a close match.

The values above are calculated as

    dist <- stringdist(projectTitles[[iproj]], controlledProjectTitles[[ictrl]], method=c("lv"))
    percentDiff <- dist/nchar(projectTitles[[iproj]])
JEDamerow commented 2 years ago

We may want to have a cut off on the percent diff. Like if it is => 1, then there is no useful match? Otherwise, this looks good to me!

mbjones commented 2 years ago

đź‘Ť Agreed on the percent match cutoff. That makes sense if we find a good cutoff level. It might be that something like 0.5 or 0.75 might produce fewer spurious suggestions.

JEDamerow commented 2 years ago

Yes, 0.5 - 0.75 cutoffs seem more reasonable. Can you try the 0.5 or so cutoff and see what happens there?

gothub commented 2 years ago

Here are some results. A tolerance of .70 seems good, as there are fewer 'no matches', and the number of 'close matches' that are obviously are not the intended title are few, at least for the number of tiles currently present. This fuzzy matching is character based, and I'm realizing it would be much better if it were token based. Anyway, here are the counts for the different tolerances (cutoffs):

tolerance: .75 Exact matches: 39 Close matches (within tolerance of 0.750000 percent): 65 No matches: 18

tolerance: .70 Exact matches: 39 Close matches (within tolerance of 0.700000 percent): 41 No matches: 42

diff: .60 Exact matches: 39 Close matches (within tolerance of 0.600000 percent): 25 No matches: 58

diff: .50 Exact matches: 39 Close matches (within tolerance of 0.500000 percent): 13 No matches: 70

I can provide the raw results if desired.

mbjones commented 2 years ago

I think the results would be helpful to see for those four cases whether several people think the "close match" list is reasonable or includes outliers (e.g. has false positives), and whether the "No match" list contains titles that should have been a close match (e.g, that are false negatives). The threshold is really about minimizing both false negatives and false positives.

tedhabermann commented 2 years ago

Joan et al.,

I have used some fuzzy matching in searching affiliation strings and found this page helpful in explaining some of the different approaches. Should be transferable to R.

I typically use multiple tests and a threshold that can be adjusted…

https://www.datacamp.com/community/tutorials/fuzzy-string-python https://www.datacamp.com/community/tutorials/fuzzy-string-python

Ted

On Dec 8, 2021, at 4:37 PM, JEDamerow @.***> wrote:

We may want to have a cut off on the percent diff. Like if it is => 1, then there is no useful match?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCEAS/metadig-checks/issues/420#issuecomment-989314367, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABURU6JHOYUHQD7OZEUEFZTUP7T2VANCNFSM5GHGXC7A. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

gothub commented 2 years ago

For those of you who would like to view the sample results from the test script, here they are.

The diffTolerance*.lis files are the debug output of the R program runs that compares project tiles from Solr to titles obtained from the ESS-DIVE Solr service. The controlledProjectTitles.lis is the controlled list of project tiles obtained from the projects.json controlled list.

testFiles.zip

JEDamerow commented 2 years ago

We decided to stick with exact matches for now, and to add instructions on how to find the appropriate project name in our UI. We do not have a suitable public facing project list other than the autocomplete feature in our data submission UI.

@gothub The message for now is: On failure: "Warning. The DOE project name listed is not from the controlled list of projects. When entering project name, use the autocomplete feature to choose from the existing projects. If you can not find your project name, try entering the PI name."

gothub commented 2 years ago

Initial version in commit 728700101d3bdc360601aeeb952743b355e5f036

gothub commented 2 years ago

Regarding the ESS-DIVE project list located at https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json - add a mechanism to refresh the local copy (to metadig-engine) of this file periodically, for example, daily. This may have to be implemented at the metadig-engine level and not the check level.

gothub commented 2 years ago

This check is now in the ESS-DIVE 1.1.0 suite.

jeanetteclark commented 1 year ago

As discussed in #440, it might be best to pull from an external source as opposed to metadig/data so that the file can more easily be updated.

@vchendrix or @mburrus can you point me to a stable location of that file hosted somewhere? Is https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json best? (Note: that URL doesn't actually resolve for me)

mbjones commented 1 year ago

@jeanetteclark See issue #438, which provides the new request URI to update the service. I think we should close this issue, and plan the new work over in #438 and #440.