FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
472 stars 151 forks source link

Is using an automatically recognized 13bp adapter sufficient? #165

Closed user-tq closed 1 year ago

user-tq commented 1 year ago

I am not proficient in removing adapters, as I have always relied on TrimGalore's automatic recognition before. I only became aware of this when I recently communicated with laboratory personnel. My kit manual is that 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT{insert}GATCGGAAGAGCACACGTCTGAACTCCAGTCAC{8bp index}ATCTCGTATGCCGTCTTCTGCTTG-3' and cutadapt note you should avoid doing so as that sequence AGATCGGAAGAGC occurs multiple times in the human genome.

so I plan to modify my CMD to

trim_galore -q 20 --length 20   -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
  --fastqc    --paired ./test_raw_1.fq.gz   ./test_raw_2.fq.gz  -o trim_set

Could you give me some advice?

FelixKrueger commented 1 year ago

Off the top of my head I am not exactly which sequence here would occur in Read 1 and which one in Read 2, but if in doubt you could do test out both options, and you should be able to see a steep drop in the trimming histogram after base 13. Also, as the {insert} is getting A-tailed on the 3' end, both sequences need to start with an A, as well as the universal start a AGATCGGAAGAGC, which would make the command line either:

trim_galore -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
  --fastqc    --paired ./test_raw_1.fq.gz   ./test_raw_2.fq.gz  -o trim_set

or

trim_galore -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -a2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC   --fastqc    --paired ./test_raw_1.fq.gz   ./test_raw_2.fq.gz  -o trim_set

I remember having done some test with the universal Illumina adapter sequence (AGATCGGAAGAGC) at some point. It is true that there are a number of locations in the genome where this sequence occurs, but I had forgotten the exact details. So I quickly had another look. Here is an example for the human genome:

bowtie2 -a --score-min L,0,0 -c AGATCGGAAGAGC -x /references/Homo_sapiens.GRCh38 
0       0       15      38727103        0       13M     *       0       0       AGATCGGAAGAGC   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       272     4       42732697        255     13M     *       0       0       GCTCTTCCGATCT   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       256     3       48254550        255     13M     *       0       0       AGATCGGAAGAGC   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       272     1       221146506       255     13M     *       0       0       GCTCTTCCGATCT   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       256     11      114779487       255     13M     *       0       0       AGATCGGAAGAGC   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       272     6       4493956 255     13M     *       0       0       GCTCTTCCGATCT   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       272     5       149867216       255     13M     *       0       0       GCTCTTCCGATCT   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       256     5       64690510        255     13M     *       0       0       AGATCGGAAGAGC   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       256     13      56182421        255     13M     *       0       0       AGATCGGAAGAGC   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       256     19      30548315        255     13M     *       0       0       AGATCGGAAGAGC   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       272     12      93854159        255     13M     *       0       0       GCTCTTCCGATCT   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU
0       272     6       92574416        255     13M     *       0       0       GCTCTTCCGATCT   IIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:13 YT:Z:UU

So the answer is: 12. There are 12 positions in the human genome where this sequence occurs, and where would indeed have to expect your to the clipped for any type of 'normal' DNA/RNA-sequencing. In bisulfite sequencing (for which Trim Galore had initially been written), the adapter sequence does not occur ever (due to the C to T conversion), so this would certainly be a non-issue for those kind of experiments.

So I suppose ultimately it boils down to a judgement call. While it is true that these 13 bp would be trimmed in a default Trim Galore run with adapter auto-detection, it doesn't necessarily mean that you do not get any alignments to those regions at all, but rather that reads to these loci might end up somewhat shorter than you would expect without trimming. If these happen to be the 12 regions you are studying in your project, however, I would agree that some more diligence is due.

Generally speaking, I would argue that unduly shortening some reads for a handful of positions (say 12) in the genome is very well justified for the majority of cases, where the impact will probably never manifest in anything measurable. We had numerous occasions where we had been assured that the adapters used were Nextera (when they were in fact Illumina standard adapters), or vice versa. This was the very reason of implementing the adapter auto-detection, which saved as an unknown number of re-processing, or even worse, errors stemming from trimming an incorrect adapter sequence.

The fact that you are looking at a manufacturer's documentation (and a note saying don't use the 13bp defaults) and come up with a command line that is almost certainly incorrect in 1 or potentially 2 places (see above) will almost certainly have a bigger effect than using the default auto-detection. Again with the exception that if the positions that share sequence with the adapter sequence are your research focus, you should probably fine-tune.

I know it is a compromise, but one where I think the benefits by far outweigh any potential detrimental effects. Happy to discuss.

FelixKrueger commented 1 year ago

Haven't heard back, but it might be good to have this documented somewhere in any case.