gbif / backbone-feedback

2 stars 0 forks source link

Basionym relations incorrectly derived #274

Open Mesibov opened 2 years ago

Mesibov commented 2 years ago

See https://github.com/gbif/backbone-feedback/issues/275 where a fungus specialist found 98 pairs of accepted fungus names attrubuted in each case to the same basionym.

This problem was attributed to a name-matching algorithm in a 2017 portal post, https://github.com/gbif/portal-feedback/issues/450

I repeated the fungus exercise and found

Fungi, 196 records Plants, 2223 records Animals, 5647 records

The plant and animal totals are odd numbers because there are quite a few "triplets" (same originalNameUsageID, 3 different accepted scientificNames), "quadruplets", "quintuplets", "sextuplets" and even 1 "septuplet":

originalNameUsageID|taxonID|scientificName 5032326|1306697|Listrognathus annulipes (Cameron, 1904) 5032326|1308757|Ophionellus annulipes (Cameron, 1911) 5032326|1277594|Goryphus annulipes (Cameron, 1909) 5032326|5025783|Paraphylax annulipes (Cameron, 1905) 5032326|1278776|Euchalinus annulipes (Cameron, 1903) 5032326|1282545|Gotra annulipes (Cameron, 1906) 5032326|1292812|Coelichneumon annulipes (Cameron, 1905)

taxonID 5032326 is Allotheca annulipes Cameron, 1906 (Hymenoptera, Ichneumonidae, "accepted") https://www.gbif.org/species/5032326 https://www.catalogueoflife.org/data/taxon/BZP5

​Here are the others:

Listrognathus annulipes (Cameron, 1904) https://www.gbif.org/species/1306697​ ("Basionym relation derived") ​https://www.catalogueoflife.org/data/taxon/3VGY4

Ophionellus annulipes (Cameron, 1911) https://www.gbif.org/species/1308757​ ("Basionym relation derived") https://www.catalogueoflife.org/data/taxon/49YPB

Goryphus annulipes (Cameron, 1909) https://www.gbif.org/species/1277594 ("Basionym relation derived") https://www.catalogueoflife.org/data/taxon/3H3JH

Paraphylax annulipes (Cameron, 1905) https://www.gbif.org/species/5025783 ("Basionym relation derived") https://www.catalogueoflife.org/data/taxon/4D8JD

Euchalinus annulipes (Cameron, 1903) https://www.gbif.org/species/1278776 ("Basionym relation derived") https://www.catalogueoflife.org/data/taxon/6GTS2

Gotra annulipes (Cameron, 1906) https://www.gbif.org/species/1282545 ("Basionym relation derived") https://www.catalogueoflife.org/data/taxon/6L3T4

Coelichneumon annulipes (Cameron, 1905) https://www.gbif.org/species/1292812 ("Basionym relation derived") https://www.catalogueoflife.org/data/taxon/5ZH52

These all appear to be failures of the name-matching algorithm ("Basionym relation derived"). The plant and animal problems were found with

awk -F"\t" '$18=="Plantae" && $15=="accepted" && $5 != "" {a[$5]++;b[$5][$1"|"$6]++} END {for (i in b) {for (j in b[i]) {if (a[i]>b[i][j]) print i "|" j}}}' Taxon.tsv

awk -F"\t" '$18=="Animalia" && $15=="accepted" && $5 != "" {a[$5]++;b[$5][$1"|"$6]++} END {for (i in b) {for (j in b[i]) {if (a[i]>b[i][j]) print i "|" j}}}' Taxon.tsv

mdoering commented 2 years ago

@Mesibov please see my replies in the original issue gbif/backbone-feedback#275. It is primarily a source problem and the behaviour desired. But I fully agree there is room for improvement...

Mesibov commented 2 years ago

@mdoering, note that the basionym cannot have been published after the new name was published. In the example I cite the basionym was published in 1906, so the 4 new names from 1903, 1904, 1905 and 1905 should be excluded from the derived relationship, and the 1906 one needs checking. This could be used as a check on the source; if the source fails this test, there's a problem there.

mdoering commented 2 years ago

No, in the example you have given we don't know the year of publication of the specific combination - there is no requirement to do so in zoology. We only know the year of the basionyms. If those authorships are correct then all those different years refer to different basionyms, not the Allotheca one from 1906. It seems the basionym detection allows for some fuzziness in the year, I don't remember from the top of my head.

One thing to improve for sure is to not create a derived basionym relation in case the homotypic grouping is skipped for some reason. The relations are likely to be wrong in that case.

Mesibov commented 2 years ago

@mdoering, As a zoological taxonomist I'm quite well aware that new combinations don't have dates in their formatting. But I'm still confused. Taking Listrognathus annulipes as an example, the backbone attaches it by taxonID 1306697 in Reference.tsv to

Cameron, P. (1904) Descriptions of new species of aculeate and parasitic Hymenoptera from northern India.: Annals and Magazine of Natural History. 13:211-233.

while the "basionym" derived for this name has taxonID 5032326 and Reference.tsv assigns that to

Cameron, P. (1906) Descriptions of new species of parasitic Hymenoptera chiefly in the collection of the South African Museum, Cape Town.: Annals of the South African Museum. 5:17-186.

In both cases GBIF's reference source is the Catalogue of Life. So even if you only know those publication dates as the dates of basionyms, how did Listrognathus annulipes get referred to Allotheca annulipes as its original name?

This is what both Bolshakov and I are asking, and what I suggested above with dates could be an additional check for you when checking whether basionym "derivation" is working correctly or not.

mdoering commented 2 years ago

Yes, as I said above: I think the basionym detection allows for some uncertainty in years and I guess it would be better to remove that.

In a side note, the references are wrongly attached. They are the publication of the original name, not the recombination, and should only be present with the basionym. If the recombination was never published there should be no reference. But this is probably a very common error for zoological names in many sources we use. @yroskov @gdower for your attention. This is both true for Dwc and ColDP though I find the DwC definition worth an additional comment.

mdoering commented 2 years ago

I've created https://github.com/tdwg/dwc/issues/405