idea: low complexity filter should depend on gc content

OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)

MIT License

1.82k stars 334 forks source link

idea: low complexity filter should depend on gc content #76

Open tseemann opened 5 years ago

tseemann commented 5 years ago

the low complexity filter has a fixed default 30% that is priobably ok for genomes with 50% GC but if my bacterial genome say has 78% GC, then statistically 30% might throw away too much might be cool to adapt to the GC ?

royfrancis commented 5 years ago

There should also be a separate GC content filter. Similar to what prinseq does.

oschwengers commented 5 years ago

@royfrancis 's idea of an prinseq alike approach would be very interesting! Given a normal distribution of read GC contents one could throw away +/- 3 standard deviations below/above the global mean GC.

Another very simple approach would be to throw away all reads which have an 'insanely' low/high GC... something like < 15 % or > 85 %. Might be interesting to have a systematic view on let's say all RefSeq genomes to get a proper guideline....

royfrancis commented 5 years ago

Computing a GC cut-off is going to be tricky as it will depend heavily on the dataset (organism, tissue, highly expressed transcripts, mitochondrial reads etc). It might be easier to just leave it to the user to set a cut-off.

tseemann commented 5 years ago

This is very risky for bacteria which often contain mobile genetic elements with GC quite different to the rest of the chromosome. Some key genes like the shiga-toxins in E.coli are extreme GC (>85%) in the middle of the gene and often have poor sequencing coverage already.

oschwengers commented 5 years ago

hmm... ok then it's certainly a bad idea.. I once implemented an after-assembly contig filter including this criteria but this was meant for contigs which is a different story.

Still, it would be interesting to have a systematic view on GC min/max values over a huge range of bacterial genomes with a sliding window, .e.g. 150 nt. I'll put it on my list :-)

tseemann commented 5 years ago

Mobile elements make everything more difficult.