FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
459 stars 149 forks source link

High PolyA content in R2 reads after trimming. #180

Open AmrSaadeldin opened 9 months ago

AmrSaadeldin commented 9 months ago

Hello

I am working with human whole-genome bisulfite sequencing data, and paired-end reads. After using Trim Galore to remove adapters, I encountered an issue with the FastQC report for R2 reads across all samples, which indicates a high PolyA content. This is unexpected since Trim Galore usually removes PolyA sequences. My primary concern is whether to proceed with mapping, given that the R1 reads are fine and have passed all FastQC tests. Additionally, it's important to note that both R1 and R2 reads in all samples do not show any overrepresented sequences and meet most other FastQC criteria, except for the adapter content in R2 reads. I am seeking advice on how to address this issue with the PolyA content in R2 reads and whether it's advisable to move forward with the current data.

Below the images before and after trimming.

the first image: The diagram shows the adapter content before trimming across various samples. Each line in the diagram represents the adapter content for a specific sample. The blue lines indicate the Illumina adapters in the R1 and R2 samples, while the orange line represents the polyA content in one of the R1 samples. The remaining lines, colored red and light blue, correspond to the polyA adapters content in all R2 samples

The second image: This is the FastQC report depicting adapter content after it has been trimmed using Trim-Galore! Every line in this report represents the polyA adapter content. The orange-red line at the bottom illustrates the polyA content for one of the R1 samples. All other lines in the report correspond to the polyA content in all the R2 samples.

Screenshot 2023-11-19 at 13 47 28 Screenshot 2023-11-19 at 13 48 07
FelixKrueger commented 9 months ago

Hi @AmrSaadeldin ,

Thanks for the details. Here is my initial assessment of the situation:

In either case, I don't think the results would be very different in either case.

AmrSaadeldin commented 9 months ago

Hi @FelixKrueger, thank you so much for your help and your detailed observations.

Based on your insights, I am now contemplating whether to conduct another round of trimming using the A{10} parameter. However, I'm concerned that this might introduce bias in the data. Considering this, my inclination is towards proceeding directly with the mapping phase. I suspect that the sequences might either not map uniquely or not map at all, which, as you mentioned, could be due to technical artifacts or genomic stretches of A's.

Given these possibilities, do you think proceeding directly to mapping, without an additional trimming step, is a sound approach for our downstream analysis? Thank you again.

FelixKrueger commented 9 months ago

My gut feeling is that you should be fine to proceed as-is, but for your own ease of mind I would potentially run a test (maybe just on a single sample?) in parallel. If you can convince yourself that the effects are either undetectable or negligibly small, you should be well prepared to answer any questions in that direction (in theory, Read 2 is the read where the methylation state is encoded by G/A (and not C/T as for Read 1), so if there is some sort of technical bias that makes it through to the uniquely mapped stage (which I doubt) you would expect some more unmethylated calls at these positions.