biocore / q2-greengenes2

A QIIME 2 plugin for interaction with the Greengenes2 database
BSD 3-Clause "New" or "Revised" License
26 stars 3 forks source link

Meaning of 6-digit numeric suffix in classification results #9

Closed ilnamkang closed 1 year ago

ilnamkang commented 1 year ago

Hi,

I've tried using one of the files on the ftp site for "qiime feature-classifier classify-sklearn" command.

The file I downloaded from the ftp site is "2022.10.backbone.full-length.nb.qza".

This is the qiime command I tried. ----- qiime feature-classifier classify-sklearn --i-classifier 2022.10.backbone.full-length.nb.qza --i-reads test.qza --o-classification test_taxonomy.qza --p-n-jobs 72 -----

Below is the excerpt of the taxonomy string of the result. ----- dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592522; fBurkholderiaceae_A_592522 dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592522; fBurkholderiaceaeA592522; gIdeonella_A_591966; s dBacteria; pProteobacteria; cGammaproteobacteria; o__Burkholderiales_592522; fBurkholderiaceae_A_592522; gLimnohabitans_A; s__Limnohabitans_A curvus dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592522; fBurkholderiaceae_A_592522; gLimnohabitans_A; sLimnohabitans_A curvus dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592522; fBurkholderiaceae_A_592522; gLimnohabitans_A; sLimnohabitans_A sp001412575 dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592522; fBurkholderiaceae_A_592522; gLimnohabitans_A; s__Limnohabitans_A sp001412575 dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592522; fBurkholderiaceae_A_592522; gOttowia_586836 dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592524; fBurkholderiaceaeA580492; gPolynucleobacter dBacteria; pProteobacteria; cGammaproteobacteria; oBurkholderiales_592524; fBurkholderiaceae_A_580492; gPolynucleobacter dBacteria; pProteobacteria; cGammaproteobacteria; o_Burkholderiales597441; fMethylophilaceae; gRFPI01; s__RFPI01 sp009926205 -----

I'd like to know the meaning of 6-digit numeric suffix (indicated in bold) attached to some taxa names.

Thanks.

Ilnam

wasade commented 1 year ago

Hi @ilnamkang, the numeric suffix is a unique label to differentiate clades in the phylogeny. In this case, "o__Burkholderiales" is polyphyletic so in order to produce a unique lineage string based on the phylogeny it is necessary to uniqify the label.

ilnamkang commented 1 year ago

Thank you for your clear explanation.

May I ask one more question?

As far as I know, (nearly) all taxa in the GTDB are monophyletic in the genome-based trees constructed by the GTDB team.

Is it possible and maybe not unusual that monophylectic taxa in the GTDB appear polyphyletic in this new Greengenes2 phylogeny?

Thanks.

Ilnam

wasade commented 1 year ago

That's a good question! The records may be monophyletic in GTDB, but this is a very different phylogeny and one that includes a broader range of organisms. We do preserve the polyphyletic suffixes that originate from GTDB but we found it was necessary in some instances to introduce our own to ensure the uniqueness of taxon labels.

ilnamkang commented 1 year ago

Thank you very much for your clear explanation.

wasade commented 1 year ago

You're welcome!