HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
75 stars 25 forks source link

Taxonomy for eukaryotic sequences #124

Open pm-leung opened 3 years ago

pm-leung commented 3 years ago

Hi phyloFlash team, thanks for developing this awesome tool!

My question is related to the taxonomy assignment of NTU for Eukaryotic sequences. I know that Eukaryote has much more taxonomic ranks than Bacteria and Archaea. However, phyloFlash output can only display a maximum of 7 levels, which gives much less resolution of the eukaryotic community. This results in some limitations, for example, filtering of contaminating sequences like human SSU post phyloFlash run. Is there a direct way in phyloFlash to display more taxonomic ranks for eukaryotic sequences?

All the best, Bob

kbseah commented 3 years ago

Hello Bob,

Thanks for your message. You're right, this is a major limitation of the tool as it is currently designed. At the moment it may be possible to hack it a bit by setting the -taxlevel parameter to a number > 7, but the output will not be displayed properly.

I'm afraid that any fix would require a substantial rewrite because the current software is quite tightly linked to the SILVA database and its assumptions, but thanks for bringing this up, it's good to know how people are using the tool in ways that we don't always expect.

Best regards, Brandon

pm-leung commented 3 years ago

Hi Brandon,

Thank you so much for the reply! I tried the hack to set a higher -taxlevel and the full taxonomy can now be found in the phyloFlash.NTUabundance.csv (though phyloFlash.NTUfull_abundance.csv can only display a maximum of 7 levels which is a bit contradictory to its file name). The number of taxonomic ranks for different eukaryotes is highly uneven, with some insect taxa can have up to a whopping 19 levels and human at 18 levels. This adds to additional pain to properly present the data.

I absolutely understand the difficulties to directly fix the handling of eukaryotic taxonomy for the tool. It'll be great if there is an option in the tool to present taxonomy or summary of prokaryote SSU and eukaryote SSU separately.

Many thanks, Bob

kbseah commented 3 years ago

Hello Bob,

Yes the eukaryotic taxonomy definitely has more ranks than the prokaryotic one, and the main limitation as you have found is that they are not all to the same depth. phyloFlash is dependent on the SILVA reference database, and that project has to date been more focused on prokaryotes (see page).

Your suggestion of splitting the prokaryote and eukaryote taxonomy could be a good workaround in the meanwhile, thanks! We'll have to think about how to implement it..

Best regards, Brandon