HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

bbmap -id #151

Closed ghost closed 2 years ago

ghost commented 3 years ago

Hi all,

I am running as such:

phyloFlash.pl -read1 S_1.fastq.gz -read2 /S_2.fastq.gz -readlength 100 -lib S_def_6 -dbhome phyloflash/db/138.1 -taxlevel 6 -html -treemap -zip -log -poscov

I have noticed a large number of reads mapped that do not match my expectations for this sample, and I have used several read sequences in the SILVA ACT service to check their taxonomy - most frequently SILVA ACT returns an unclassified read with %ID at around 40-50%, whilst phyloFLASH keeps these reads with bbmap -id 70-90%.

Do you have any idea what is happening?

Thanks!

kbseah commented 3 years ago

Hello, SILVA ACT and phyloFlash use different aligners: SINA and Bbmap respectively. In theory SINA should be more sensitive, but Bbmap would be much faster. 40% is pretty low, though, which seems to me like it might be an artefact and not a true rRNA sequence.

A few possibilities:

Hope this helps!

ghost commented 3 years ago

Hi! Thanks for the super quick answer!

So, replying to your points:

Do you have any suggestions on how to best deal with this?

kbseah commented 3 years ago

To check if I understand correctly, was it the unassembled reads from your data that you tried to classify with SILVA ACT, or the reference sequences that they mapped to? Or both?

If there is a full length sequence assembled with phyloFlash, and this has Blast hits to more than one reference genome sequence, I'd be inclined to say that it's probably not a mapping artefact. From what you said above, you were not expecting to find Streptococcus in the sample, is that right? Have you tried to assemble the metagenome, too?

SINA uses a last-common-ancestor method to classify sequences, based on the taxonomy of the best hits. If there is a misannotation of a reference sequence in the database, it's possible that the classification gets thrown off as a result. There is some curation and filtering of what goes into the SILVA database, but sometimes things fall through the cracks.