Open pragermh opened 4 years ago
About 2/3 of GTDB are non Linnean names like phyla Desulfobacterota_B
, UBA3054
or species 01-FULL-45-15b sp001822655
. This occurs on all ranks:
I wonder how much of these names are stable over time to be of any relevance for users. This also challenges name parsing.
@thomasstjerne do you know how these non Linnean names are generated? Are they stable across releases?
For COL to include GTDB it would require the name parser to detect OTU names reliably. It's more difficult than for BOLD or UNITE identifiers, but seems doable.
thanks for the FAQ links, @pragermh !
COL is considering to use the List of Prokaryotic names with Standing in Nomenclature LPSN. It is a different taxonomy, but would at least be an up to date one.
LPSN is probably mostly (only?) suitable for formally described taxa (= a cultured type specimen exist). The major part of prokaryotic data in GBIF origins from metabarcoding studies based on the 16S region. These data types will have taxonomy assigned through some classification pipeline using a reference database such as GTDB or SILVA. When assessing the taxonomic diversity in these types of studies, it is important to understand the full diversity, not only the (small) fraction that has been formally described. I could fear that using LPSN for GBIF indexing would result in a coarser taxonomic assignment for large parts of the prokaryotic data.
GTDB states that
LPSN is used as the primary nomenclatural reference for establishing naming priorities and nomenclature types.
So LPSN probably more or less makes up the subset of formally described taxa in GTDB.
@dhobern @olafbanki @yroskov for your attention
Can't we use the football LPSN for the core COL content and the rest of GTDB to extend it?
I believe LPSN is nomenclature, whereas GTDB is a full Taxonomy (Phylogeny based on assembled genomes). Therefore GTDB might re-organise the classification of names in LPSN quite heavily in some cases. There might therefore be quite some conflicts.
LPSN has been reorganised to reflect consensus classification in a way that matches what we need for a full COL/GBIF species list based on published names. I am sure there will be some mismatches with molecular phylogeny, but surely that is no different from all other parts of the list.
Indeed that seems like a plausible and consistent way forward to me. Not much different to the situation with UNITE and BOLD really.
Neither a taxonomist nor prokaryote expert myself, but perhaps @erikrikarddaniel or @andand has something to add?
Indeed that seems like a plausible and consistent way forward to me. Not much different to the situation with UNITE and BOLD really.
GTDB is quite different than BOLD or UNITE. The latter two uses a short fragment (COI and ITS) to cluster into "species-like" taxa (BINs, SHs). These BINs or SHs are then placed into a consensus classification that in most cases will be much like COL / GBIF. GTDB produces the full classification (Phylogeny) and may sometimes (often?) deviate from "consensus" classification, even at high levels such as Phyla.
Here is an example: The species Binatus soli in GTDB GBIF and LPSN
Classification in GTDB: Bacteria > Desulfobacterota_B > Binatia > Binatales > Binataceae > Binatus > Binatus soli
Classification in LPSN: Bacteria > Binatota > Binatia > Binatales > Binataceae> Binatus > Binatus soli
The Phylum Binatota
is not present in the latest two versions of GTDB (history).
By extending LPSN with GTDB I guess we would get both phyla Desulfobacterota_B
and Binatota
and the species Binatus soli
would be placed in the phylum Binatota
.
GTDB has 4 species in the genus Binatus we should avoid that these end up in two homonym genera in two phyla.
Would we then move the 3 species of Binatus
not known to LPSN into the phylum Binatota
? (along with Binatus soli
)
And what about sibling genera of Binatus
?
It might be safe to conclude that the GTDB phylum Desulfobacterota_B
could simply be considered a synonym of the LPSN phylum Binatota
, but I could imagine that there could potentially be many splits and merges, giving pro parte synonyms that would be less straight forward to deal with.
The Phylum Binatota is not present in the latest two versions of GTDB (history). By extending LPSN with GTDB I guess we would get both phyla Desulfobacterota_B and Binatota and the species Binatus soli would be placed in the phylum Binatota. GTDB has 4 species in the genus Binatus we should avoid that these end up in two homonym genera in two phyla. Would we then move the 3 species of Binatus not known to LPSN into the phylum Binatota? (along with Binatus soli) And what about sibling genera of Binatus ?
Yes, the merging procedure we consider will prevent splitting genera into different classifications, even if far apart. By extending LPSN we would therefore use the Binatota placement for all Binatus species coming from GTDB. This is obviously problematic, but that problem is true for all the "extended" sources. I suppose the difference in Bacteria taxonomy is just much larger than anywhere else in the tree, so it has a bigger impact and is more visible.
And what about sibling genera of Binatus ?
Will they move to the other phylum along with Binatus?
Not if we do it as in the GBIF builds. But it might be a good idea if we can work out how to do this.
The current thinking does not touch the higher classification, at least not above family level. So if a source has a yet unplaced genus with a classification that is also not represented at all the genus will be in Incertae sedis. If the kingdom is know under that kingdom, whatever snaps in the classification. In the Binatus example the family Binataceae
is the same, so all siblings would also be placed there.
Here are some metrics from the latest data we have in CLB: GTDB: https://www.checklistbank.org/dataset/2214/imports/52 LPSN: https://www.checklistbank.org/dataset/2015/imports
I have created a first version of an ColDP LPSN dataset using their API here: https://www.dev.checklistbank.org/dataset/284997
The main problem with that is that I cannot find a way to access the classification they do show on their site (parent link on top). I have contacted them and asked how to do that, lets see.
Also they made LPSN CC BY SA!
I have opened a dedicated issue for adding LPSN: https://github.com/CatalogueOfLife/data/issues/632
@DianRHR @camiplata I have discussed with @thomasstjerne and @tobiasgf how to best integrate GTDB into the XCOL. We want the classic LPSN in the base release of COL, but need to add GTDB OTU names additionally to integrate with eDNA data. Our suggestion would be to add GTDB add genus level and below and maybe also to include families, but nothing higher up. Could you look into any issues with that please?
Dataset title Genome Taxonomy Database (GTDB)
Dataset contact & access https://gtdb.ecogenomic.org/ https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv
Taxonomic group & CoL sector Prokaryotes: Archaea & Bacteria
Dataset description The Genome taxonomy database (GTDB) is a standardised microbial taxonomy based on genome phylogeny, primarily funded by the Australian Research Council. GTDB currently includes ca. 30,000 prokaryote species clusters based on 195,000 genomes from isolates, metagenomes and single-cells from RefSeq and GenBank. References: Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8. Parks, D.H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology, 36: 996-1004, https://doi.org/10.1038/nbt.4229.
Motivation for changing/adding GTDB is a well-defined and fast-growing taxonomy of prokaryotes: Between release 89 (Aug 2019) and 95 (Jul 2020), included genomes and species clusters both increased with ca. 30%, while 99,77% of existing genomes were still assigned to the same species clusters. When publishing a dataset of ca. 3000 Amplicon Sequence Variants (ASVs) of Baltic Sea microbes to the Swedish GBIF node (SBDI), we furthermore found that merging GTDB into the GBIF taxonomy backbone substantially increased the taxonomic resolution for our occurrences: The share of ASVs identified at genus and family level increased from 32 to 62% and 55 to 77%, respectively. Since the database is based on draft genomes rather than a single taxonomic marker gene, it gives a lot of flexibility in terms of usage. For example, shotgun metagenome data can be annotated as well as 16S rRNA gene data.
Suggested by Anders Andersson (KTH), Daniel Lundin (LnU) and Maria Prager (SU/KI), all associated with the Swedish Biodiversity Data Infrastructure (SBDI).