Open jhnwllr opened 3 years ago
@yroskov @gdower @NicBailly @olafbanki this should be interesting for the COL gap analysis and prioritisation.
"Rhynchelytrium nepeng" looks like a badly misspelled species name Rhynchelytrum repens (Willd.) C.E.Hubb. from Poaceae family. It is a synonym for Melinis repens (Willd.) Zizka
It is present in the CoL: https://www.catalogueoflife.org/data/taxon/3ZFLC
What is a GBIF data provider (n_publisher) for 4562 occurences with this name?
"Aliaria" petiolata (M.Bieb.) Cavara & Grande is again misspelled name Alliaria petiolata (M.Bieb.) Cavara & Grande from Brassicaeae family.
It is present in the CoL: https://www.catalogueoflife.org/data/taxon/BTPF
Iris xiphium "(L.) Dryand. ex Ait." appears in the CoL as Iris xiphium L.: https://www.catalogueoflife.org/data/taxon/6MXYP
"Candolleomyces candolleanus" looks like a synonym for Psathyrella candolleana (Fr.) Maire: https://www.catalogueoflife.org/data/taxon/4NDVN.
Such combination is not present in the CoL yet, because it was published in 2020.
Expapillata firmatoi (Barretto, Martins & Pellegrino, 1956). The combination is not present in CoL, but the species is present in it as Lutzomyia firmatoi (Barretto, Martins & Pellegrino, 1956): https://www.catalogueoflife.org/data/taxon/3WGXN.
Majority of taxonomic databases in Zoology do not collect all consequent combinations in synonymy. Unfortunately, combination "Expapillata firmatoi" will not appear in the CoL even after retirement of CIPA and new update of Systema Dipterorum.
Callophycus laxus is an algae species from Rhodophyta. Unfortunately, AlgaeBase left CoL in 2013. Before, this species was present in the CoL: http://www.catalogueoflife.org/annual-checklist/2012/details/species/id/15246
Tropicagama temporalis (Agamidae family). The combination is not present in CoL, but the species is present in it as Gowidon temporalis (Günther, 1867): https://www.catalogueoflife.org/data/taxon/3H4YT
Combination "Tropicagama temporalis" may appear in the CoL with new RetileDB update.
Sterile sorediate crust looks to me as a Latin descriptive phrase, rather than scientific name.
Ganglionus catenatus
Seems, the genus Ganglionus (2001) from Curculionidae is missing in WTaxa.
"Ganglionus catenatus" might be also consequent combination for Cionus catenatus Fairmaire, L., 1897, the species present in the CoL: https://www.catalogueoflife.org/data/taxon/VFG8
Deanemyia samueli (Deane, 1955). The combination is not present in CoL, but the species is present in it as Lutzomyia samueli (Deane, 1955): https://www.catalogueoflife.org/data/taxon/3WH2L
kingdom | v_scientificname | occ_count | n_dataset | n_publisher | the species is present in CoL as: | link |
---|---|---|---|---|---|---|
Plantae | Rhynchelytrium nepeng | 4562 | 2 | 1 | Melinis repens | https://www.catalogueoflife.org/data/taxon/3ZFLC |
Plantae | Aliaria petiolata (M.Bieb.) Cavara & Grande | 1579 | 3 | 1 | Alliaria petiolate | https://www.catalogueoflife.org/data/taxon/BTPF |
Plantae | Iris xiphium (L.) Dryand. ex Ait. | 1202 | 2 | 1 | Iris xiphium | https://www.catalogueoflife.org/data/taxon/6MXYP |
Fungi | Candolleomyces candolleanus | 1150 | 1 | 1 | Psathyrella candolleana | https://www.catalogueoflife.org/data/taxon/4NDVN. |
Animalia | Expapillata firmatoi (Barretto, Martins & Pellegrino, 1956) | 1024 | 1 | 1 | Lutzomyia firmatoi | https://www.catalogueoflife.org/data/taxon/3WGXN. |
Plantae | Callophycus laxus | 828 | 5 | 5 | until 2013 | http://www.catalogueoflife.org/annual-checklist/2012/details/species/id/15246 |
Animalia | Tropicagama temporalis | 686 | 3 | 3 | Gowidon temporalis | https://www.catalogueoflife.org/data/taxon/3H4YT |
Fungi | Sterile sorediate crust | 604 | 1 | 1 | no | non-sci name (?) |
Animalia | Ganglionus catenatus | 482 | 2 | 2 | Cionus catenatus (?) | https://www.catalogueoflife.org/data/taxon/VFG8 |
Animalia | Deanemyia samueli (Deane, 1955) | 460 | 1 | 1 | Lutzomyia samueli | https://www.catalogueoflife.org/data/taxon/3WH2L |
CONCLUSION
of 10 mismatched names:
Thanks Yuri! Good to know most species do exist in COL. Still I do wonder what we can do to improve the recall as Dave would say. Misspellings in occurrence data is something GBIF needs to worry about, improve the fuzzy matching or report to data publishers about wrong data.
But for missing synonyms users coming in with these names (and they appear on thousands of records) have a hard time to use COL. This is where the extended COL is really needed.
@ahahn-gbif can we contact https://www.gbif.org/dataset/a720d91c-36d6-45a2-a163-e7145ebb30dd as they seem to have 4561 records with a typo for a grass and are pretty much the only ones that have this wrong name?
@yroskov @mdoering I have not followed the development of the GBIF fuzzy matching for a long time, so I do not know how it works exactly. When I worked for Dave Remsen to understand why so many fish names were not matching since FishBase collates numerous new combinations and misspellings. I can resend you the report. I also started to develop what I called the GSAy matching, that I half published when we created BiOnym for D4Science. I am not sure if the COL IT group have meetings still but I could present this method. Basically, instead of applying the fuzzy match at once on all names, the match is done by step by step to solve the cases of same nature, avoid a number of false positives and probably structure the answer to the provider in a less scary way.
So for something like Rhynchelytrium nepeng (col:Melinis repens), which is just a misspelling, it is better to try to fix them at source rather than try to make the fuzzy matching work...
But for something like Iris xiphium (L.) Dryand. ex Ait. (col:Iris xiphium), which is not a misspelling, but still does not match at the species rank, do we think we should also to get the publishers to change it at source or is there a way to add it to Synonyms and Combinations? Why does "Iris xiphium ... extra stuff ...." not match already to "Iris xiphium L." in COL? Is it an edge case? https://www.catalogueoflife.org/data/taxon/6MXYP
Iris xiphium (L.) Dryand. ex Ait.
and Iris xiphium L.
have entirely different authorships, so they should not match.
Unless we want the occurrence matching to do so and ignore authorships before doing a higher match? That is sth we could add actually and return as a FUZZY_MATCH.
Adding more synonyms to COL and broaden it's name coverage is exactly what the next step of building an extended COL is about. Which should happen right after summer.
"Iris xiphium (L.) Dryand. ex Ait. and Iris xiphium L. have entirely different authorships" @mdoering, the authorships are not "entirely different". The strings differ, but the abbreviation "L." for Linnaeus is included in both strings. The scientific names involved are the same and they must match. Names should also be matched by their name strings first, e.g., Iris xiphium should be matched with a 100% score to Iris xiphium. Having homonyms (names with the same spelling but different authors) is the exception rather than the rule, and that is why for the vast majority of species-rank names just the name strings are needed for an accurate match. Correcting matches for homonyms is to be handled by checking the authority match, e.g., "author, year" and reducing the 100% of the scientific name match alone to a lower one when based on all components. What this case also tells you is that the GBIF matching process needs a powerful parser for the authority string. A parser that is able to handle several typographical characters (parentheses, periods, commas, ampersand, etc.) and that is still able to recognize and match sub-string elements such as "L.". The parser can be powered by a dictionary, which is greatly facilitated in Botany because the abbreviations for many authors' names are TDWG standards.
I have written a script for finding GBIFs most wanted verbatim scientific names. That might be of interest for this project.
https://github.com/jhnwllr/gbif_most_wanted_names
What is a most wanted name?
It is a species-rank and reasonable looking character string (or name) that does not match to the GBIF backbone except at a higher rank.
The top 10 most wanted names according to the most recent run.
find the full dataset here.
Names like "Sterile sorediate crust" might need to be added to the blacklist.