bbmap -id - Githubissues

ghost commented 3 years ago

Hi all,

I am running as such:

phyloFlash.pl -read1 S_1.fastq.gz -read2 /S_2.fastq.gz -readlength 100 -lib S_def_6 -dbhome phyloflash/db/138.1 -taxlevel 6 -html -treemap -zip -log -poscov

I have noticed a large number of reads mapped that do not match my expectations for this sample, and I have used several read sequences in the SILVA ACT service to check their taxonomy - most frequently SILVA ACT returns an unclassified read with %ID at around 40-50%, whilst phyloFLASH keeps these reads with bbmap -id 70-90%.

Do you have any idea what is happening?

Thanks!

kbseah commented 3 years ago

Hello, SILVA ACT and phyloFlash use different aligners: SINA and Bbmap respectively. In theory SINA should be more sensitive, but Bbmap would be much faster. 40% is pretty low, though, which seems to me like it might be an artefact and not a true rRNA sequence.

A few possibilities:

The reads could be "recruiting" to one or a few outlier sequences in the database, e.g. if they contain low-complexity sequence that was not properly filtered out. You could check which reference sequence these "unclassified" reads map to with the phyloFlash pipeline.
Also check what the positional coverage histogram looks like. If there are artefact reads being pulled in, they tend to map to specific positions in the reference, rather than evenly across the SSU rRNA gene as they should. Uneven coverage with one or two big spikes at certain positions is a diagnostic sign of this.
Did you try assembling the extracted SSU sequences (e.g. using the spades assembly option)? If the novel sequence is a true new outlier with high coverage, then you might be able to assemble a full length sequence and analyze that phylogenetically. With mapping-only approaches there is of course more uncertainty about phylogenetic placement.

Hope this helps!

ghost commented 3 years ago

Hi! Thanks for the super quick answer!

So, replying to your points:

I checked and indeed there are (at least) 4 reference seqs that consistently come up as unclassified with SINA, but classified with bbmap CFFF01000203.4900.6512 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae CVOF01000044.509.1766 Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus pneumoniae CKQU01000078.7468.8814 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae AWPA01000003.163001.164539 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pyogenes
the positional coverage histogram is skewed towards the first half of the sequence, but doesnt show one or two dramatic peaks
1 SSU sequence with streptococcus taxa assignment did assemble, blasting to "Streptococcus pyogenes strain MGAS2221 chromosome, complete genome",

Do you have any suggestions on how to best deal with this?

kbseah commented 3 years ago

To check if I understand correctly, was it the unassembled reads from your data that you tried to classify with SILVA ACT, or the reference sequences that they mapped to? Or both?

If there is a full length sequence assembled with phyloFlash, and this has Blast hits to more than one reference genome sequence, I'd be inclined to say that it's probably not a mapping artefact. From what you said above, you were not expecting to find Streptococcus in the sample, is that right? Have you tried to assemble the metagenome, too?

SINA uses a last-common-ancestor method to classify sequences, based on the taxonomy of the best hits. If there is a misannotation of a reference sequence in the database, it's possible that the classification gets thrown off as a result. There is some curation and filtering of what goes into the SILVA database, but sometimes things fall through the cracks.

HRGV / phyloFlash

bbmap -id #151