High PolyA content in R2 reads after trimming.

AmrSaadeldin commented 9 months ago

Hello

I am working with human whole-genome bisulfite sequencing data, and paired-end reads. After using Trim Galore to remove adapters, I encountered an issue with the FastQC report for R2 reads across all samples, which indicates a high PolyA content. This is unexpected since Trim Galore usually removes PolyA sequences. My primary concern is whether to proceed with mapping, given that the R1 reads are fine and have passed all FastQC tests. Additionally, it's important to note that both R1 and R2 reads in all samples do not show any overrepresented sequences and meet most other FastQC criteria, except for the adapter content in R2 reads. I am seeking advice on how to address this issue with the PolyA content in R2 reads and whether it's advisable to move forward with the current data.

Below the images before and after trimming.

the first image: The diagram shows the adapter content before trimming across various samples. Each line in the diagram represents the adapter content for a specific sample. The blue lines indicate the Illumina adapters in the R1 and R2 samples, while the orange line represents the polyA content in one of the R1 samples. The remaining lines, colored red and light blue, correspond to the polyA adapters content in all R2 samples

The second image: This is the FastQC report depicting adapter content after it has been trimmed using Trim-Galore! Every line in this report represents the polyA adapter content. The orange-red line at the bottom illustrates the polyA content for one of the R1 samples. All other lines in the report correspond to the polyA content in all the R2 samples.

FelixKrueger commented 9 months ago

Hi @AmrSaadeldin ,

Thanks for the details. Here is my initial assessment of the situation:

in its default mode, Trim Galore looks for adapter contamination, which has obviously worked as expected.
it does not perform PolyA removal as a matter of course
the amount of PolyA appears to start right at the very start of sequences, and continues to increase in a linear fashion with the read length. There are now at least several possible scenarios:
1. You really do see reads that are complete repetitions of A from start to end (which would probably be some kind of technical artefact?): these reads will almost certainly not align uniquely in the genome, so would effectively get filtered out during the alignment step
2. I am not exactly sure how long poly-A sequences is in FastQC, but assuming your read-length was 150bp and looking at the plot I would assume 10-12bp. There are number of positions in the genome that are a stretch of 10-12As in a row, and if these were enriched for some reason you would see the value of PolyA creep up (which might not really be poly-A in this case). If these regions in the genome would be affected, there is a chance that 150 bp sequences would map just fine.
  - for peace of mind you could run a second round of trimming, like so: A single base may also be given as e.g. -a A{10}, to be expanded to -a AAAAAAAAAA

In either case, I don't think the results would be very different in either case.

AmrSaadeldin commented 9 months ago

Hi @FelixKrueger, thank you so much for your help and your detailed observations.

Based on your insights, I am now contemplating whether to conduct another round of trimming using the A{10} parameter. However, I'm concerned that this might introduce bias in the data. Considering this, my inclination is towards proceeding directly with the mapping phase. I suspect that the sequences might either not map uniquely or not map at all, which, as you mentioned, could be due to technical artifacts or genomic stretches of A's.

Given these possibilities, do you think proceeding directly to mapping, without an additional trimming step, is a sound approach for our downstream analysis? Thank you again.

FelixKrueger commented 9 months ago

My gut feeling is that you should be fine to proceed as-is, but for your own ease of mind I would potentially run a test (maybe just on a single sample?) in parallel. If you can convince yourself that the effects are either undetectable or negligibly small, you should be well prepared to answer any questions in that direction (in theory, Read 2 is the read where the methylation state is encoded by G/A (and not C/T as for Read 1), so if there is some sort of technical bias that makes it through to the uniquely mapped stage (which I doubt) you would expect some more unmethylated calls at these positions.

FelixKrueger / TrimGalore

High PolyA content in R2 reads after trimming. #180