FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
472 stars 151 forks source link

How is an error defined? #133

Closed CharlotteAnne closed 2 years ago

CharlotteAnne commented 2 years ago

Hi! I have a very basic question - how are you defining a sequencing error? Do you mean an N nucleotide in the sequence? Thank you for your help!

FelixKrueger commented 2 years ago

I suppose very generally, a sequencing error is defined as a base that is incorrectly called as the correct base. This may be a different base entirely (if say the fluorescent signal for a different base was stronger than the 'correct' one), or a come back as N indeed.

In terms of adapter trimming, the error is simply defined as a non-matching base. The default tolerated error rate is 0.1, so up to 10% of the adapter sequence maybe different in the actual read. Here is an example:

If we use the default Illumina adapter sequences used by Trim Galore, AGATCGGAAGAGC, the length of the sequence is 13bp, so a 10% error rate would allow 1 mismatch in that sequence (rounded down from 1.3).

So if you had a sequence like this:

GATCGTATAGCTAGCATAGCTAGC**AGATCGGAAGAGC**
GATCGTATAGCTAGCATAGCTAGC**AGGTCGGAAGAGC**

both would get trimmed to:

GATCGTATAGCTAGCATAGCTAGC

If there was an additional mismatch in to the adapter sequence, like so:

GATCGTATAGCTAGCATAGCTAGC**GGGTCGGAAGAGC**

The sequence would not be trimmed at all (the the sequence in bold now has 2 mismatches to the adapter sequence, exceeding the 0.1 error rate. Makes sense?

CharlotteAnne commented 2 years ago

Marvellous, makes perfect sense! Thank you very much.