HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
177 stars 32 forks source link

most reads removed in subsequent qc step #139

Closed jmmaritz closed 11 months ago

jmmaritz commented 4 years ago

Hi, I generated a set of simulated reads using the included Hiseq error profile and when I ran them through a typical qc step for a metagenomics study (bbduk, minlen=50, maq=30) over 80% of my reads did not pass the qc filter. This seems odd to me as in most Hiseq runs you would expect the majority of the reads to pass. Any idea what could be causing this?

Thanks, Julia

HadrienG commented 4 years ago

Hi!

The included Hiseq model is based on a slightly worse than average Hiseq run for us here. While the average read read quality is around 35, the average quality for the last base pair drops to 29.

I've never used bbduk but the manual says that maq filters by average quality of the read, so either you got quite unlucky with the variance while generating your dataset, or something is up bbduk average quality calculations.

EDIT: I've generated mean quality histogram both with fastqc and prinseq, as well as with bbduk aqhist option on some hiseq simulated reads. Both fastqc and prinseq correctly report an average quality of 35, while bbduk reports 29. I went ahead and tried fastp to filter by q=30, and only lost 17% of the simulated reads to the filter. I'd say something weird is happening with bbduk.

Hope that helps, Hadrien.