CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
14 stars 11 forks source link

Update TaxGroup parser #1284

Open mdoering opened 6 months ago

mdoering commented 6 months ago

An important part of the name usage matching, apart from plain name matching, is to compare the classification of matched candidates to disambiguate homonyms. As classifications can be very different in some parts or exist only patchy the algorithm rather tries to match each higher taxon to a limited, hand selected set of hierarchical taxonomic groups to keep the major groups apart, e.g plants to animals. For each of the groups we maintain a text file listing higher names down to families that unambiguously indicate such a group. For example Asteraceae clearly point to Angiosperms.

ChecklistBank has a tool to analyse all higher names and report those that currently are not listed in any of the files. Go through at least the names down to class, better order, and add them to the respective parser files.

File with higher CLB names not_mapped.tsv.gz to any tax group

mdoering commented 6 months ago

@DianRHR @camiplata maybe you can go through some of these names and place them into the correct parser file?

mdoering commented 2 months ago

We should consider to also process all verbatim gbif classifications and see which ones create no group at all