FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
461 stars 150 forks source link

Adapter trimming of Illumina DNA PCR-Free prep sequences #120

Closed andrews-matt closed 3 years ago

andrews-matt commented 3 years ago

I have used Trim Galore v0.6.5 to trim the adapter sequences from some paired-end WGS data that was prepared using Illumina's DNA PCR-Free prep, which uses adapter sequences CTGTCTCTTATACACATCT+ATGTGTATAAGAGACA. Although your documentation does not specify the second sequence, looking at the clipped reads it does appear that Trim Galore was able to automatically detect and trim the reverse complement (TGTCTCTTATACACAT) from the reads in the R2 file.

I understand that the first sequence was detected from the 12bp as being identical to the Nextera adapters so that makes sense, but I seek to understand how Trim Galore was able to detect the reverse complement of the second sequence considering that Illumina only released the PCR-Free prep in the past 12 months and I am using an older version of Trim Galore. Mainly I want to understand whether this is expected behavior or whether I should manually specify the adapter sequence just to be safe.

Thank you.

FelixKrueger commented 3 years ago

Hi Matt,

I think the short answer is: Trim Galore isn't explicitly aware of this new PCR-free kit, and if it did the right thing there chances are it was by pure luck. If the Nextera sequence is detected, the same sequence is being used for both R1 and R2, here CTGTCTCTTATA.

My gut feeling tells me that your R1 sequence is a perfect match to the Nextera sequence, so that is fine:

CTGTCTCTTATA
CTGTCTCTTATACACATCT

Your R2 sequence

CTGTCTCTTATA        (Nextera)
ATGTGTATAAGAGACA    (adapter sequence you specified)
 .TGTCTCTTATACACAT   (reverse complement of adapter sequence you specified)

This would mean that the R2 adapter matches the Nextera sequence in 11 out of 12 positions, and if any of these last 11 positions were found at the 3' end of reads, it would trim the correct sequence.

If the sequence was found earlier in the read (not right at the very end), then the first C in the Nextera adapter sequence would also have to match for the adapter to be trimmed as contamination as a full match. There are now options:

1) If we trust that the sequence you gave above is correct and complete, the C used by specifying the Nextera adapter sequence (which occurred automatically) is not from the adapter, that C is matched to a genomic base at that position. If we assume that C occurs in ~20% of cases you would get a 'perfect adapter match' in ~20% of (c|b)ases, but in 80% of cases there would be a mismatch (T, A or G). As the default error profile of Trim Galore is 0.1, it would allow 10% of bases in the adapter sequence to mismatch, which in the case of a 12bp adapter sequence, such as the Nextera sequence, would mean 1 mismatch is tolerated. In other words: Even though the sequence used isn't exactly correct, it just luckily does the right thing anyway.

2) Is there a possibility that the sequence you specified above is short by one base pair? If that were true, and that base would be a G, then it would be a perfect reverse complement of the Nextera adapter sequence:

CTGTCTCTTATACACATCT+ATGTGTATAAGAGACA G

What also often happens is that the adapter sequences are given relative to the top strand of the DNA, which isn't necessarily the way sequences get read when they come off the sequencer.

If we would further assume that the sequence you gave are the R1 + R2 adapters, but in fact the R1 adapter occurs at the 3' end of R1, and the R2 adapter at the 5' end (which would match up with the illustration on slide 4 of Illumina's tagmentation, we could end up with a situation that would like this (reference to the top strand, I means insert base for top strand, and X is the reverse complement to I on the bottom strand):

5' ATGTGTATAAGAGACA G IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII CTGTCTCTTATACACATCT 3'
3' TACACATATTCTCTGT C XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX GACAGAGAATATGTGTAGA 5' 

This would perfectly fit the following model:

I realise that there are quite a few if's in this assumption, but it would make so much sense... Even if the additional G that I invented to fit my flow of words better was not really present, it would most likely still do the right thing as outlined in option 1) (maybe this is even true for any Nextera based tagmentation trimming)?

I hope this is more helpful than it is confusing, do let me know if you've got any questions.

andrews-matt commented 3 years ago

Hi Felix,

My apologies for the slow reply but thank you so much for the comprehensive answer. This is really an excellent response and I think you've hit the nail on the head with your first explanation.

The sequence I specified is correct according to the Illumina documentation (https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/Nextera/SequencesNXTILMPrepAndPCR.htm) and when I compare my raw R2 files against the trimmed R2 files I can see that the reads had an additional C trimmed just before TGTCTCTTATACACATCT on the 3' end.

Pre-trim sequence: 5' ... TGAGCCCTGTCTCTTATACACATCT ... 3' Post-trim sequence: 5' ... TGAGCC 3'

So it seems like Trim Galore does just identify the sequence as belonging to the Nextera adapter by chance.

Thank you once again for your help!

FelixKrueger commented 3 years ago

Hi Matt,

Thanks for the confirmation. So in the end it does indeed seem like it is accidentally doing the right thing (even though it takes this one mismatch in the adapter sequence to work). As long it is doing the right thing, maybe we don't need to dig deeper? You could of course also try to specify the adapters manually, and see it it makes any difference, using:

-a CTGTCTCTTATA -a2 ATGTGTATAAGAGACA

I'll close go ahead and close the issue for the moment, if you have any further thoughts just let me know.