Open tseemann opened 5 years ago
There should also be a separate GC content filter. Similar to what prinseq does.
@royfrancis 's idea of an prinseq alike approach would be very interesting! Given a normal distribution of read GC contents one could throw away +/- 3 standard deviations below/above the global mean GC.
Another very simple approach would be to throw away all reads which have an 'insanely' low/high GC... something like < 15 % or > 85 %. Might be interesting to have a systematic view on let's say all RefSeq genomes to get a proper guideline....
Computing a GC cut-off is going to be tricky as it will depend heavily on the dataset (organism, tissue, highly expressed transcripts, mitochondrial reads etc). It might be easier to just leave it to the user to set a cut-off.
This is very risky for bacteria which often contain mobile genetic elements with GC quite different to the rest of the chromosome. Some key genes like the shiga-toxins in E.coli are extreme GC (>85%) in the middle of the gene and often have poor sequencing coverage already.
hmm... ok then it's certainly a bad idea.. I once implemented an after-assembly contig filter including this criteria but this was meant for contigs which is a different story.
Still, it would be interesting to have a systematic view on GC min/max values over a huge range of bacterial genomes with a sliding window, .e.g. 150 nt. I'll put it on my list :-)
Mobile elements make everything more difficult.
the low complexity filter has a fixed default 30% that is priobably ok for genomes with 50% GC but if my bacterial genome say has 78% GC, then statistically 30% might throw away too much might be cool to adapt to the GC ?