EcologicalTraitData / traitdataform

A package to manage and compile functional trait data into predefined templates
https://ecologicaltraitdata.github.io/traitdataform/
Other
33 stars 9 forks source link

refine warnings on get_gbif_taxonomy #39

Open mikeroswell opened 4 years ago

mikeroswell commented 4 years ago

Loving get_gbif_taxonomy so far.

I just ran it on a list my colleagues maintain of about 20,000 "valid" names of hymenopteran species. About 16 came back with the warning " Selected first of multiple equally ranked concepts!". Of these, the majority meet the following condition: scientificName == scientificNameStd. However, the ones that do not (at the treshold I used) seem likely to be mis-matched. It would be super helpful, I think, to provide a different warning on these two cases, as when going through and manually checking results, it's great to have warnings in cases where the automation probably worked, but it's also nice to be able to focus easily on the ones most likely to be a problem.

Thanks!

mikeroswell commented 4 years ago

e.g. here is a place I would want a "louder" warning: I search "Leioproctus carinatus" (a valid species name) and the selected sp. was "capillatus," which I believe is a separate but also valid species name. This is different from "Hoplitis rubicrus" which matches with itself... the warning is still helpful but I'm less concerned in the case that the identical species name is matched with itself than if a totally different species matches :-)

fdschneider commented 4 years ago

Thanks for the feedback. More specific and louder warnings as well as possibilities to interact directly with the function would be great. As you probably figured, the problem is caused by fuzzy matching producing a match with the wrong valid taxon. If using option fuzzy = FALSE, the mismatch should be avoided.

I considered switching off fuzzy matching by default, but misspellings are very frequent in data and would not be addressed otherwise (see #38).

The function get_gbif_taxonomy() is essential for the package, as it provides a quick mapping of taxa to GBif Backbone Taxonomy following logical rules of-thumb. Figuring out all possible matching errors is tedious, so please keep posting those here. Unfortunately, improving the function is not the core focus of the work right now, as I'm also hoping for a more general way of implementing taxon mapping (e.g. also providing a choice of the reference taxonomy).

mikeroswell commented 4 years ago

Oh, I definitely think fuzzy matching is desirable (That's why I'm using your function!). I just think it would be nice to distinguish the two cases as when i do this in a minute with 600,000 rows of hand-entered data, I'm going to have a lot of warnings but really want to focus on the ones that are likely going to mess me up. Thanks!

mikeroswell commented 4 years ago

Another place the warnings could be clearer: I have a mix of valid and invalid names that are receiving a mix of these two warnings: warnings == " No match! Check spelling or lower confidence threshold!" and warnings == "No matching species concept! ". Would be helpful to know what triggers one warning vs. the other... for me these warnings are guides to troubleshooting mismatches, not simply criteria for filtering. Thanks!