gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
18 stars 4 forks source link

Interpret GTDB OTU names #74

Closed mdoering closed 3 years ago

mdoering commented 3 years ago

The Genome Taxonomy Database is supposed to be the primary source for Bacteria & Archaea in COL and GBIF: https://github.com/CatalogueOfLife/data/issues/202

About 2 thirds of the names are OTU names that do not have a single, simple syntax as BOLD or SH numbers. The name parser should be improved to ideally recognize all these patterns as NameType=OTU.

Extract a list of OTU patterns first. Examples are:

Not that the species names stick to a binomial pattern with GENUS SPECIFIC_EPITHET and that the specific epithet always starts with sp.

Another common pattern is the _[A-Z] suffix extending Linnean names. Genus OTU names seem to follow various patterns.

mdoering commented 3 years ago

Why are some genus names formed from a strain identifier?

A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. For example, the genome GCF_000318095.2 has the NCBI organism name Prevotella sp. oral taxon 473 str. F0040. However, this genome is more closely related to Prevotellamassilia and Alloprevotella. Consequently, we assign it to the placeholder genus gF0040. If the organism had been assigned a binomial species name such as Prevotella oralitaxus str. F0040 we would assign it to the placeholder genus gPrevotella_A to indicate it is not a true Prevotella species, but that there are representative genomes that have been assigned to a species. A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. For example, the genome GCF_000318095.2 has the NCBI organism name Prevotella sp. oral taxon 473 str. F0040 and is assigned to the genus Alloprevotella in NCBI. However, this genome appears to be neither assigned to Prevotella, Alloprevotella or another closely related genus Prevotellamassilia in GTDB. Consequently, we assign it to the placeholder genus gF0040. If the organism had been assigned a binomial species name such as Prevotella oralitaxus str. F0040, and it is not part of true Prevotella in GTDB, we would assign it to the placeholder genus gPrevotella_A to indicate it is not a true Prevotella species, but that there are representative genomes that have been assigned to a species.

Why do some genus and species names end with an alphabetic suffix?

Genus names ending with an alphabetic suffix indicate genera that are i) polyphyletic according to the current GTDB reference tree, or ii) subdivided based on taxonomic rank normalisation according to the current GTDB reference tree.

Species names end with an alphabetic suffix if the GTDB species cluster is (or was previously) associated with a species name, but the correct application of this name is ambiguous or the name assigned to a different GTDB species cluster based on the presence of type material or via majority voting.

The lineage or species cluster containing the nomenclature type or, in case of species, satisfying the majority vote criteria retains the unsuffixed name and all other lineages/clusters are given alphabetic suffixes, indicating that they are placeholder names that need to be replaced in due course. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.

Why do some family and higher rank names end with an alphabetic suffix?

Taxon names above the rank of genus appended with an alphabetic suffix indicate groups that are under the following category: i) groups that are not monophyletic in the GTDB reference tree, but for which there exists alternative evidence that they are monophyletic groups; ii) groups whose placement is unstable between releases.

A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.

mdoering commented 3 years ago

https://github.com/gbif/name-parser/commit/121aed4a8ea26d7fe60b66b7ace15e5af9787ba5 and https://github.com/gbif/name-parser/commit/ad05197602c8bf01326b49c3ceffafc839290403