Closed user-tq closed 1 year ago
Off the top of my head I am not exactly which sequence here would occur in Read 1 and which one in Read 2, but if in doubt you could do test out both options, and you should be able to see a steep drop in the trimming histogram after base 13. Also, as the {insert} is getting A-tailed on the 3' end, both sequences need to start with an A
, as well as the universal start a AGATCGGAAGAGC
, which would make the command line either:
trim_galore -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--fastqc --paired ./test_raw_1.fq.gz ./test_raw_2.fq.gz -o trim_set
or
trim_galore -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -a2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --fastqc --paired ./test_raw_1.fq.gz ./test_raw_2.fq.gz -o trim_set
I remember having done some test with the universal Illumina adapter sequence (AGATCGGAAGAGC
) at some point. It is true that there are a number of locations in the genome where this sequence occurs, but I had forgotten the exact details. So I quickly had another look. Here is an example for the human genome:
bowtie2 -a --score-min L,0,0 -c AGATCGGAAGAGC -x /references/Homo_sapiens.GRCh38
0 0 15 38727103 0 13M * 0 0 AGATCGGAAGAGC IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 272 4 42732697 255 13M * 0 0 GCTCTTCCGATCT IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 256 3 48254550 255 13M * 0 0 AGATCGGAAGAGC IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 272 1 221146506 255 13M * 0 0 GCTCTTCCGATCT IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 256 11 114779487 255 13M * 0 0 AGATCGGAAGAGC IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 272 6 4493956 255 13M * 0 0 GCTCTTCCGATCT IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 272 5 149867216 255 13M * 0 0 GCTCTTCCGATCT IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 256 5 64690510 255 13M * 0 0 AGATCGGAAGAGC IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 256 13 56182421 255 13M * 0 0 AGATCGGAAGAGC IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 256 19 30548315 255 13M * 0 0 AGATCGGAAGAGC IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 272 12 93854159 255 13M * 0 0 GCTCTTCCGATCT IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
0 272 6 92574416 255 13M * 0 0 GCTCTTCCGATCT IIIIIIIIIIIII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:13 YT:Z:UU
So the answer is: 12. There are 12 positions in the human genome where this sequence occurs, and where would indeed have to expect your to the clipped for any type of 'normal' DNA/RNA-sequencing. In bisulfite sequencing (for which Trim Galore had initially been written), the adapter sequence does not occur ever (due to the C to T conversion), so this would certainly be a non-issue for those kind of experiments.
So I suppose ultimately it boils down to a judgement call. While it is true that these 13 bp would be trimmed in a default Trim Galore run with adapter auto-detection, it doesn't necessarily mean that you do not get any alignments to those regions at all, but rather that reads to these loci might end up somewhat shorter than you would expect without trimming. If these happen to be the 12 regions you are studying in your project, however, I would agree that some more diligence is due.
Generally speaking, I would argue that unduly shortening some reads for a handful of positions (say 12) in the genome is very well justified for the majority of cases, where the impact will probably never manifest in anything measurable. We had numerous occasions where we had been assured that the adapters used were Nextera (when they were in fact Illumina standard adapters), or vice versa. This was the very reason of implementing the adapter auto-detection, which saved as an unknown number of re-processing, or even worse, errors stemming from trimming an incorrect adapter sequence.
The fact that you are looking at a manufacturer's documentation (and a note saying don't use the 13bp defaults) and come up with a command line that is almost certainly incorrect in 1 or potentially 2 places (see above) will almost certainly have a bigger effect than using the default auto-detection. Again with the exception that if the positions that share sequence with the adapter sequence are your research focus, you should probably fine-tune.
I know it is a compromise, but one where I think the benefits by far outweigh any potential detrimental effects. Happy to discuss.
Haven't heard back, but it might be good to have this documented somewhere in any case.
I am not proficient in removing adapters, as I have always relied on TrimGalore's automatic recognition before. I only became aware of this when I recently communicated with laboratory personnel. My kit manual is that
5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT{insert}GATCGGAAGAGCACACGTCTGAACTCCAGTCAC{8bp index}ATCTCGTATGCCGTCTTCTGCTTG-3'
and cutadapt noteyou should avoid doing so as that sequence AGATCGGAAGAGC occurs multiple times in the human genome.
so I plan to modify my CMD to
Could you give me some advice?