FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
461 stars 150 forks source link

Adapter trimming in RRBS #132

Open franrodalg opened 2 years ago

franrodalg commented 2 years ago

Hi Felix,

We've recently analysed a paired-end 150 bp RRBS dataset, which ended having insert sizes well below 100 bp. This really concerned us, because insert size is having a huge impact on the differential alignment efficiencies between regions of the genome we need to compare. Tracing back factors potentially impacting insert size, we realised adapter trimming was having a huge impact (this is from a different dataset than our original source of concern, but the trend is very similar):

image

Until now, I had never realised the large portion of sequenced bases that TrimGalore could be removing in this stage. I had blindly trusted that it would only be the exact sequence of the adapter that gets removed ("AGATCGGAAGAGC" by default, so less than 10% of each mate if the entire adapter gets sequenced), but for these libraries we often see 30-40% of trimmed bases, and even higher in some cases.

Are those normal ranges?

Checking the RRBS guide it says:

To control the stringency of the adapter removal process, one gets to specify the minimum number of required overlap with the adapter sequence, else it will default to 1. This default setting is extremely stringent, i.e., an overlap with the adapter sequence of even a single bp is spotted and removed.

We have been blindly using the default settings, but should we tweak them? If so, is there any guide on how much overlap should we be requesting?

The quote from the guide also has be quite confused, though. What does "a single bp" entail here? Because any nucleotide (A, C, G, or T) appearing on any read would theoretically overlap with some base within the adapter sequence, right? Wouldn't that mean that every single base on every single read would need to be removed? Any clarification would really be appreciated.

Cheers, Fran

franrodalg commented 2 years ago

Asking my experimental colleagues, I just realised that the 13 bp indicated in the text are (as the document explicitly mentions) only the start of the adapter sequences, which are 33 bp long. This would make the maximum length to remove around 22% of the reads, right?

FelixKrueger commented 2 years ago

Hi @franrodalg

I have tried to make a slide explaining this here (slide 27).

Spelled out it would translate to:

So in short, I think it is entirely normal to get very short insert sizes in RRBS (compare the first figure in the RRBS guide), so 150bp paired-end reads will not only be trimmed by a substantial amount but also Read 2 will be completely redundant as it covers the exact same PCR amplified fragment from the other end (which is disregarded and discarded during the overlap detection in the methylation extraction).

I would not change the stringency fo the adapter detection for RRBS, as it is essential to identify adapter contamination so that you can trim a further 2bp for the option --rrbs (which removes artifactual methylation states in RRBS data). Even if the trimming appears overly harsh, you will often only lose 1-3 bp that are more likely genomic bases rather than the beginning of adapter contamination, but I think it is worth keeping this as is (i.e. the stringent defaults) to avoid including spurious methylation calls.

franrodalg commented 2 years ago

Thanks so much, Felix. It is much clearer now.

Nevertheless, as I mentioned before, another recent sequencing run of ours shows median insert sizes even lower than any other RRBS run we've ever seen (for 150bp paired-end, that is)

image

The entire adapter should be 33 bp, I think which shouldn't account for more than half of the bases in the reads (even if they completely overlap), right?

This is the first time we use a Zymo-Seq RRBS prep kit, and we have used a non-directional alignment mode as suggested, but we wonder if there's anything else we could do to avoid losing so much yield?

Cheers, Fran

franrodalg commented 2 years ago

Hi @FelixKrueger ,

I have done some further QCs, and it seems to me that this issue is more generally concerning than I first anticipated. I thought this would only affect us since I doubt many people will need to check relative read counts in different regions from RRBS data. But I have now realised that the insert size (which appears to be largely, if not completely, driven by the adapter trimming step), associates extremely strongly with global CpG methylation:

image

Methylation here is expressed as proportion, not % (so 0.5 is 50%, not 0.5%). This is from the libraries in the original post (created with a NEB kit), but we see something very similar on the Zymo libraries.

Any insight or advice on how to proceed would be really appreciated.

Cheers, Fran

FelixKrueger commented 2 years ago

Hi Fran,

apologies for the late reply, but Easter egg hunts got in the way...

I don't think I quite agree with the statement that "fragment size is [largely] driven by the adapter trimming step". Rather, the steps you take during sample and library preparation (treatment, clean-up, size selection, etc.), combined with the bridge amplification on the flow cell (which favours shorter fragments) should guide the fragment size distribution of your sequencing results. The amount of trimmed adapter bases is thus a secondary effect, but not the cause. Taking a 2x150bp library as an example, if your insert fragment is 70bp long, you will run sequence into the adapter sequence at that point and remove 80bp from R1 as well as R2 (on top of that, the left-over 70bp will be fully covered by both R1 and R2).

In the case of RRBS, the size selection and bridge amplification are even further exacerbated by the fact that MspI digestions chops the genome into very short fragments, with the vast majority being <150bp in total (see the theoretical sizes produced in the RRBS guide).

Regarding the 'biological' effect that shorter fragments display lower levels of methylation: I am not very surprised by this finding as you would expect most MspI sites in the genome (CCGG) in very CG-rich regions of the genome. And these fairly short, CG-rich regions of the genomes are often associated with CpG islands or promoters, which are typically found consistently unmethylated. I didn't completely understand the graph you were showing (was each dot in the graph a different library?), but would that explain what you are seeing?

franrodalg commented 2 years ago

Thanks, Felix.

I guess being purely computational myself, I find it hard to wrap my head around the idea of insert sizes being truly so small. I assumed size selection would only keep fragments of proper length, but I guess I was spoiled with the higher quality libraries I had analysed in the past. Would you recommend a much more stringent size threshold in TrimGalore (say, 100bp reads or even higher) to balance the libraries? I know this will likely reduce the yield substantially, but it might be worth the sacrifice.

Regarding your interpretation of the methylation vs insert size correlation, we had guessed that was the case as well. But isn't it concerning? I suppose your idea is that people should only compare positional methylation levels, but I think many studies rely on overall levels (as reported by Bismark) to infer differences between conditions.

(And yes, each point in every figure I showed above corresponds with a single library)

Cheers, Fran

FelixKrueger commented 2 years ago

I personally would not filter out anything from the start (i.e. at the trimming step). If you wanted to test the effect of different fragment lengths you could always do this as a post-processing step on the BAM files, where you could also choose different cut-offs if you so desired.

The rationale for not removing shorter fragments from the libraries overall would probably be that the sampling of fragments in RRBS is not a random draw - the fragment sizes, and thus what you can get out of a library, are governed by the presence of MspI sites across the genome. So even if the average methylation values might show some overall differences I would expect that the same positions should behave very similarly between different samples. So while the fragment length might indeed have an impact on the methylation levels, all samples should 'suffer' from this phenomenon equally. This might indeed start to fall apart if you start comparing different methods and fragment sizes with each other - so one should try to keep as many factors the same as possible.

You as long as similar regions are covered at comparable levels I don't think I would apply any additional filtering. You just need to keep in mind that some regions behave very differently depending largely on sequence (GC) content.

franrodalg commented 2 years ago

Just to clarify, all libraries in each figure have been generated using exactly the same procedure, and yet their (median) insert sizes are somewhat different. We are not comparing different methods and (intentional) fragment sizes.

FelixKrueger commented 2 years ago

I didn't think you were, that would be really bad :)

But yes, methylation data is extremely bi-modal, with >> 90% of data either being highly methylated (80-100%) or not methylated at all (0-20%). So looking at the median methylation values for the entire samples will represent a slight shift in the representation of highly and lowly methylated fragments. My point was that you should probably not dwell on looking at these averages, as the data will still look quite comparable at the local level between samples, even though you might have a slight shift in the depth of coverage. Does that make sense?

franrodalg commented 2 years ago

Now that we know of this potential impact, we will make sure to not rely on the overall values. But perhaps it's worth stating it clearly in the documentation? I've seen many studies that blindly rely on overall CpG methylation as reported by Bismark as their estimate of global methylation levels.

Anyway, thanks so much for all this, it's been really illuminating

FelixKrueger commented 2 years ago

Agreed, reporting single average value for something like methylation data is - interesting. Histograms or violin/beanplots are indeed much more informative, see e.g. here [slide 29]: https://www.bioinformatics.babraham.ac.uk/training/Methylation_Course/Visualising%20and%20Exploring%20Lecture.pdf

franrodalg commented 2 years ago

I assume those violin plots (and potential statistical comparisons between them) should only be done on common positions, then? In other words, if you plotted every single methylation value obtained on each sample/condition, you would still suffer from the potential effect of differential insert sizes when comparing means, right?

FelixKrueger commented 2 years ago

Working with common positions that were covered well in all samples would be ideal, and get you around looking at coverage issues such as in slide 22 in the above slide deck. Looking only at common positions works well if you compare a small number of samples, but this greatly reduces the number of usable positions when you have many samples.

What we do to get mitigate this is to aggreate over larger regions (e.g. windows of ~75-100 CpGs covered in all samples), where you may or may not apply a depth threshold (this is often something you can't really afford, depending on sequencing depth), and where you also require a certain minimum number of positions covered (typically at least 20 different CpGs within a 100 CpG window). This will at least make sure that we are only looking at regions that had a decent amount of coverage in all samples, and this procedure is not quite as harsh as looking at positions in common only (SeqMonk has a tick box for that).