HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

Incomplete taxonomic paths in .phyloFlash.NTUfull_abundance.csv output #184

Open chassenr opened 1 year ago

chassenr commented 1 year ago

Hi @HRGV and @kbseah ,

I have been using phyloFalsh to explore the eukaryotic component of some metagenomes and noticed that the taxonomic paths in the .phyloFlash.NTUfull_abundance.csv output table are all truncated to 7 levels, which is not sufficient for eukaryotes. In my particular example, I am interested in the taxonomic composition of Chytridiomycota (fungi), but the taxonomic path is not further resolved beyond this level (phylum). Is there a quick fix for this that I can implement myself? Are you planning to change this in upcoming phyloFlash versions? I know that eukaryotic taxonomic paths are a nightmare (especially if you want to align them with prokaryotic ones), but maybe the tax_slv_ssu_138.1.txt file will be helpful to pick a corresponding set of taxonomic ranks for both prokaryotes and eukaryotes in the output?

Thanks!

Cheers, Christiane

kbseah commented 1 year ago

Hi Christiane, thanks for pointing this out. As you note this is a tricky issue because of the longer taxonomic paths for eukaryotic paths and their inconsistent lengths in the SILVA taxonomy (and the NCBI taxonomy too).

One possibility I see is to use the PR2 taxonomy paths instead, which are standardized to 9 levels: https://pr2database.github.io/pr2database/articles/pr2_02A_silva.html

I haven't checked though what fraction of the SILVA eukaryotic sequences also appear in PR2. Some groups may not be represented in PR2 because they rely on expert curation for specific taxonomic groups.

Can't make any promises about when a new phyloFlash version will come out. As a stop-gap we could work on a SILVA database with modified taxonomy paths. Will keep this in mind

chassenr commented 1 year ago

Hi @kbseah Thanks for your fast reply. Is there maybe a way to work with the existing phyloflash output and maybe just parse the sam file differently to create the NTU table with the complete paths (independent of phyloflash)? Just as a quick fix? I tried to identify the corresponding code in the perl scripts, but since I am not a perl person that was a bit difficult for me...

kbseah commented 1 year ago

Hi Christiane, I think the best option for now is to simply parse the SAM file. They contain the SILVA accessions and header lines, which include the taxonomy paths, which you can the summarize at the level you wish.