CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

Genome Taxonomy Database (GTDB) for prokaryotes #202

Open pragermh opened 4 years ago

pragermh commented 4 years ago

Dataset title Genome Taxonomy Database (GTDB)

Dataset contact & access https://gtdb.ecogenomic.org/ https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv

Taxonomic group & CoL sector Prokaryotes: Archaea & Bacteria

Dataset description The Genome taxonomy database (GTDB) is a standardised microbial taxonomy based on genome phylogeny, primarily funded by the Australian Research Council. GTDB currently includes ca. 30,000 prokaryote species clusters based on 195,000 genomes from isolates, metagenomes and single-cells from RefSeq and GenBank. References: Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8. Parks, D.H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology, 36: 996-1004, https://doi.org/10.1038/nbt.4229.

Motivation for changing/adding GTDB is a well-defined and fast-growing taxonomy of prokaryotes: Between release 89 (Aug 2019) and 95 (Jul 2020), included genomes and species clusters both increased with ca. 30%, while 99,77% of existing genomes were still assigned to the same species clusters. When publishing a dataset of ca. 3000 Amplicon Sequence Variants (ASVs) of Baltic Sea microbes to the Swedish GBIF node (SBDI), we furthermore found that merging GTDB into the GBIF taxonomy backbone substantially increased the taxonomic resolution for our occurrences: The share of ASVs identified at genus and family level increased from 32 to 62% and 55 to 77%, respectively. Since the database is based on draft genomes rather than a single taxonomic marker gene, it gives a lot of flexibility in terms of usage. For example, shotgun metagenome data can be annotated as well as 16S rRNA gene data.

Suggested by Anders Andersson (KTH), Daniel Lundin (LnU) and Maria Prager (SU/KI), all associated with the Swedish Biodiversity Data Infrastructure (SBDI).

mdoering commented 4 years ago

About 2/3 of GTDB are non Linnean names like phyla Desulfobacterota_B, UBA3054 or species 01-FULL-45-15b sp001822655. This occurs on all ranks:

Screenshot 2020-11-18 at 06 59 51

I wonder how much of these names are stable over time to be of any relevance for users. This also challenges name parsing.

mdoering commented 4 years ago

@thomasstjerne do you know how these non Linnean names are generated? Are they stable across releases?

pragermh commented 4 years ago

There is some info on placeholder names and stability here.

mdoering commented 4 years ago

For COL to include GTDB it would require the name parser to detect OTU names reliably. It's more difficult than for BOLD or UNITE identifiers, but seems doable.

mdoering commented 4 years ago

thanks for the FAQ links, @pragermh !

mdoering commented 2 years ago

COL is considering to use the List of Prokaryotic names with Standing in Nomenclature LPSN. It is a different taxonomy, but would at least be an up to date one.

thomasstjerne commented 2 years ago

LPSN is probably mostly (only?) suitable for formally described taxa (= a cultured type specimen exist). The major part of prokaryotic data in GBIF origins from metabarcoding studies based on the 16S region. These data types will have taxonomy assigned through some classification pipeline using a reference database such as GTDB or SILVA. When assessing the taxonomic diversity in these types of studies, it is important to understand the full diversity, not only the (small) fraction that has been formally described. I could fear that using LPSN for GBIF indexing would result in a coarser taxonomic assignment for large parts of the prokaryotic data.

GTDB states that

LPSN is used as the primary nomenclatural reference for establishing naming priorities and nomenclature types.

So LPSN probably more or less makes up the subset of formally described taxa in GTDB.

mdoering commented 2 years ago

@dhobern @olafbanki @yroskov for your attention

dhobern commented 2 years ago

Can't we use the football LPSN for the core COL content and the rest of GTDB to extend it?

thomasstjerne commented 2 years ago

I believe LPSN is nomenclature, whereas GTDB is a full Taxonomy (Phylogeny based on assembled genomes). Therefore GTDB might re-organise the classification of names in LPSN quite heavily in some cases. There might therefore be quite some conflicts.

dhobern commented 2 years ago

LPSN has been reorganised to reflect consensus classification in a way that matches what we need for a full COL/GBIF species list based on published names. I am sure there will be some mismatches with molecular phylogeny, but surely that is no different from all other parts of the list.

mdoering commented 2 years ago

Indeed that seems like a plausible and consistent way forward to me. Not much different to the situation with UNITE and BOLD really.

pragermh commented 2 years ago

Neither a taxonomist nor prokaryote expert myself, but perhaps @erikrikarddaniel or @andand has something to add?

thomasstjerne commented 2 years ago

Indeed that seems like a plausible and consistent way forward to me. Not much different to the situation with UNITE and BOLD really.

GTDB is quite different than BOLD or UNITE. The latter two uses a short fragment (COI and ITS) to cluster into "species-like" taxa (BINs, SHs). These BINs or SHs are then placed into a consensus classification that in most cases will be much like COL / GBIF. GTDB produces the full classification (Phylogeny) and may sometimes (often?) deviate from "consensus" classification, even at high levels such as Phyla.

Here is an example: The species Binatus soli in GTDB GBIF and LPSN

Classification in GTDB: Bacteria > Desulfobacterota_B > Binatia > Binatales > Binataceae > Binatus > Binatus soli Classification in LPSN: Bacteria > Binatota > Binatia > Binatales > Binataceae> Binatus > Binatus soli

The Phylum Binatota is not present in the latest two versions of GTDB (history). By extending LPSN with GTDB I guess we would get both phyla Desulfobacterota_B and Binatota and the species Binatus soli would be placed in the phylum Binatota. GTDB has 4 species in the genus Binatus we should avoid that these end up in two homonym genera in two phyla.
Would we then move the 3 species of Binatus not known to LPSN into the phylum Binatota? (along with Binatus soli) And what about sibling genera of Binatus ?

It might be safe to conclude that the GTDB phylum Desulfobacterota_B could simply be considered a synonym of the LPSN phylum Binatota, but I could imagine that there could potentially be many splits and merges, giving pro parte synonyms that would be less straight forward to deal with.

mdoering commented 2 years ago

The Phylum Binatota is not present in the latest two versions of GTDB (history). By extending LPSN with GTDB I guess we would get both phyla Desulfobacterota_B and Binatota and the species Binatus soli would be placed in the phylum Binatota. GTDB has 4 species in the genus Binatus we should avoid that these end up in two homonym genera in two phyla. Would we then move the 3 species of Binatus not known to LPSN into the phylum Binatota? (along with Binatus soli) And what about sibling genera of Binatus ?

Yes, the merging procedure we consider will prevent splitting genera into different classifications, even if far apart. By extending LPSN we would therefore use the Binatota placement for all Binatus species coming from GTDB. This is obviously problematic, but that problem is true for all the "extended" sources. I suppose the difference in Bacteria taxonomy is just much larger than anywhere else in the tree, so it has a bigger impact and is more visible.

thomasstjerne commented 2 years ago

And what about sibling genera of Binatus ?

Will they move to the other phylum along with Binatus?

mdoering commented 2 years ago

Not if we do it as in the GBIF builds. But it might be a good idea if we can work out how to do this. The current thinking does not touch the higher classification, at least not above family level. So if a source has a yet unplaced genus with a classification that is also not represented at all the genus will be in Incertae sedis. If the kingdom is know under that kingdom, whatever snaps in the classification. In the Binatus example the family Binataceae is the same, so all siblings would also be placed there.

mdoering commented 1 year ago

Here are some metrics from the latest data we have in CLB: GTDB: https://www.checklistbank.org/dataset/2214/imports/52 LPSN: https://www.checklistbank.org/dataset/2015/imports

mdoering commented 9 months ago

I have created a first version of an ColDP LPSN dataset using their API here: https://www.dev.checklistbank.org/dataset/284997

The main problem with that is that I cannot find a way to access the classification they do show on their site (parent link on top). I have contacted them and asked how to do that, lets see.

Also they made LPSN CC BY SA!

mdoering commented 8 months ago

I have opened a dedicated issue for adding LPSN: https://github.com/CatalogueOfLife/data/issues/632

mdoering commented 8 months ago

@DianRHR @camiplata I have discussed with @thomasstjerne and @tobiasgf how to best integrate GTDB into the XCOL. We want the classic LPSN in the base release of COL, but need to add GTDB OTU names additionally to integrate with eDNA data. Our suggestion would be to add GTDB add genus level and below and maybe also to include families, but nothing higher up. Could you look into any issues with that please?