gbif / checklistbank

GBIF Checklist Bank
Apache License 2.0
31 stars 14 forks source link

GBIF most wanted names #176

Open jhnwllr opened 3 years ago

jhnwllr commented 3 years ago

I have written a script for finding GBIFs most wanted verbatim scientific names. That might be of interest for this project.

https://github.com/jhnwllr/gbif_most_wanted_names

What is a most wanted name?

It is a species-rank and reasonable looking character string (or name) that does not match to the GBIF backbone except at a higher rank.


The top 10 most wanted names according to the most recent run.

kingdom v_scientificname publisher_rank taxonrank occ_count n_dataset n_publisher
Plantae Rhynchelytrium nepeng SPECIES FAMILY 4562 2 1
Plantae Aliaria petiolata (M.Bieb.) Cavara & Grande SPECIES FAMILY 1579 3 1
Plantae Iris xiphium (L.) Dryand. ex Ait. SPECIES FAMILY 1202 2 1
Fungi Candolleomyces candolleanus SPECIES FAMILY 1150 1 1
Animalia Expapillata firmatoi (Barretto, Martins & Pellegrino, 1956) SPECIES FAMILY 1024 1 1
Plantae Callophycus laxus SPECIES FAMILY 828 5 5
Animalia Tropicagama temporalis SPECIES FAMILY 686 3 3
Fungi Sterile sorediate crust SPECIES FAMILY 604 1 1
Animalia Ganglionus catenatus SPECIES FAMILY 482 2 2
Animalia Deanemyia samueli (Deane, 1955) SPECIES FAMILY 460 1 1

find the full dataset here.

Names like "Sterile sorediate crust" might need to be added to the blacklist.

mdoering commented 3 years ago

@yroskov @gdower @NicBailly @olafbanki this should be interesting for the COL gap analysis and prioritisation.

yroskov commented 3 years ago

"Rhynchelytrium nepeng" looks like a badly misspelled species name Rhynchelytrum repens (Willd.) C.E.Hubb. from Poaceae family. It is a synonym for Melinis repens (Willd.) Zizka

It is present in the CoL: https://www.catalogueoflife.org/data/taxon/3ZFLC

What is a GBIF data provider (n_publisher) for 4562 occurences with this name?

yroskov commented 3 years ago

"Aliaria" petiolata (M.Bieb.) Cavara & Grande is again misspelled name Alliaria petiolata (M.Bieb.) Cavara & Grande from Brassicaeae family.

It is present in the CoL: https://www.catalogueoflife.org/data/taxon/BTPF

yroskov commented 3 years ago

Iris xiphium "(L.) Dryand. ex Ait." appears in the CoL as Iris xiphium L.: https://www.catalogueoflife.org/data/taxon/6MXYP

yroskov commented 3 years ago

"Candolleomyces candolleanus" looks like a synonym for Psathyrella candolleana (Fr.) Maire: https://www.catalogueoflife.org/data/taxon/4NDVN.

Such combination is not present in the CoL yet, because it was published in 2020.

yroskov commented 3 years ago

Expapillata firmatoi (Barretto, Martins & Pellegrino, 1956). The combination is not present in CoL, but the species is present in it as Lutzomyia firmatoi (Barretto, Martins & Pellegrino, 1956): https://www.catalogueoflife.org/data/taxon/3WGXN.

Majority of taxonomic databases in Zoology do not collect all consequent combinations in synonymy. Unfortunately, combination "Expapillata firmatoi" will not appear in the CoL even after retirement of CIPA and new update of Systema Dipterorum.

yroskov commented 3 years ago

Callophycus laxus is an algae species from Rhodophyta. Unfortunately, AlgaeBase left CoL in 2013. Before, this species was present in the CoL: http://www.catalogueoflife.org/annual-checklist/2012/details/species/id/15246

yroskov commented 3 years ago

Tropicagama temporalis (Agamidae family). The combination is not present in CoL, but the species is present in it as Gowidon temporalis (Günther, 1867): https://www.catalogueoflife.org/data/taxon/3H4YT

Combination "Tropicagama temporalis" may appear in the CoL with new RetileDB update.

yroskov commented 3 years ago

Sterile sorediate crust looks to me as a Latin descriptive phrase, rather than scientific name.

yroskov commented 3 years ago

Ganglionus catenatus

Seems, the genus Ganglionus (2001) from Curculionidae is missing in WTaxa.

"Ganglionus catenatus" might be also consequent combination for Cionus catenatus Fairmaire, L., 1897, the species present in the CoL: https://www.catalogueoflife.org/data/taxon/VFG8

yroskov commented 3 years ago

Deanemyia samueli (Deane, 1955). The combination is not present in CoL, but the species is present in it as Lutzomyia samueli (Deane, 1955): https://www.catalogueoflife.org/data/taxon/3WH2L

yroskov commented 3 years ago
kingdom v_scientificname occ_count n_dataset n_publisher the species is present in CoL as: link
Plantae Rhynchelytrium nepeng 4562 2 1 Melinis repens https://www.catalogueoflife.org/data/taxon/3ZFLC
Plantae Aliaria petiolata (M.Bieb.) Cavara & Grande 1579 3 1 Alliaria petiolate https://www.catalogueoflife.org/data/taxon/BTPF
Plantae Iris xiphium (L.) Dryand. ex Ait. 1202 2 1 Iris xiphium https://www.catalogueoflife.org/data/taxon/6MXYP
Fungi Candolleomyces candolleanus 1150 1 1 Psathyrella candolleana https://www.catalogueoflife.org/data/taxon/4NDVN.
Animalia Expapillata firmatoi (Barretto, Martins & Pellegrino, 1956) 1024 1 1 Lutzomyia firmatoi https://www.catalogueoflife.org/data/taxon/3WGXN.
Plantae Callophycus laxus 828 5 5 until 2013 http://www.catalogueoflife.org/annual-checklist/2012/details/species/id/15246
Animalia Tropicagama temporalis 686 3 3 Gowidon temporalis https://www.catalogueoflife.org/data/taxon/3H4YT
Fungi Sterile sorediate crust 604 1 1 no non-sci name (?)
Animalia Ganglionus catenatus 482 2 2 Cionus catenatus (?) https://www.catalogueoflife.org/data/taxon/VFG8
Animalia Deanemyia samueli (Deane, 1955) 460 1 1 Lutzomyia samueli https://www.catalogueoflife.org/data/taxon/3WH2L
yroskov commented 3 years ago

CONCLUSION

of 10 mismatched names:

mdoering commented 3 years ago

Thanks Yuri! Good to know most species do exist in COL. Still I do wonder what we can do to improve the recall as Dave would say. Misspellings in occurrence data is something GBIF needs to worry about, improve the fuzzy matching or report to data publishers about wrong data.

But for missing synonyms users coming in with these names (and they appear on thousands of records) have a hard time to use COL. This is where the extended COL is really needed.

@ahahn-gbif can we contact https://www.gbif.org/dataset/a720d91c-36d6-45a2-a163-e7145ebb30dd as they seem to have 4561 records with a typo for a grass and are pretty much the only ones that have this wrong name?

https://www.gbif.org/occurrence/search?q=Rhynchelytrium%20nepeng&dataset_key=a720d91c-36d6-45a2-a163-e7145ebb30dd

NicBailly commented 3 years ago

@yroskov @mdoering I have not followed the development of the GBIF fuzzy matching for a long time, so I do not know how it works exactly. When I worked for Dave Remsen to understand why so many fish names were not matching since FishBase collates numerous new combinations and misspellings. I can resend you the report. I also started to develop what I called the GSAy matching, that I half published when we created BiOnym for D4Science. I am not sure if the COL IT group have meetings still but I could present this method. Basically, instead of applying the fuzzy match at once on all names, the match is done by step by step to solve the cases of same nature, avoid a number of false positives and probably structure the answer to the provider in a less scary way.

jhnwllr commented 3 years ago

So for something like Rhynchelytrium nepeng (col:Melinis repens), which is just a misspelling, it is better to try to fix them at source rather than try to make the fuzzy matching work...

But for something like Iris xiphium (L.) Dryand. ex Ait. (col:Iris xiphium), which is not a misspelling, but still does not match at the species rank, do we think we should also to get the publishers to change it at source or is there a way to add it to Synonyms and Combinations? Why does "Iris xiphium ... extra stuff ...." not match already to "Iris xiphium L." in COL? Is it an edge case? https://www.catalogueoflife.org/data/taxon/6MXYP

mdoering commented 3 years ago

Iris xiphium (L.) Dryand. ex Ait. and Iris xiphium L. have entirely different authorships, so they should not match. Unless we want the occurrence matching to do so and ignore authorships before doing a higher match? That is sth we could add actually and return as a FUZZY_MATCH.

Adding more synonyms to COL and broaden it's name coverage is exactly what the next step of building an extended COL is about. Which should happen right after summer.

Archilegt commented 2 years ago

"Iris xiphium (L.) Dryand. ex Ait. and Iris xiphium L. have entirely different authorships" @mdoering, the authorships are not "entirely different". The strings differ, but the abbreviation "L." for Linnaeus is included in both strings. The scientific names involved are the same and they must match. Names should also be matched by their name strings first, e.g., Iris xiphium should be matched with a 100% score to Iris xiphium. Having homonyms (names with the same spelling but different authors) is the exception rather than the rule, and that is why for the vast majority of species-rank names just the name strings are needed for an accurate match. Correcting matches for homonyms is to be handled by checking the authority match, e.g., "author, year" and reducing the 100% of the scientific name match alone to a lower one when based on all components. What this case also tells you is that the GBIF matching process needs a powerful parser for the authority string. A parser that is able to handle several typographical characters (parentheses, periods, commas, ampersand, etc.) and that is still able to recognize and match sub-string elements such as "L.". The parser can be powered by a dictionary, which is greatly facilitated in Botany because the abbreviations for many authors' names are TDWG standards.