Interpreting MultiQC report for trimmed data

yaaminiv commented 3 years ago

Question: Are issues with per sequence GC content, overrepresented sequences, sequence length distributions, and per tile sequence quality scores reasons to re-trim data/not progress with `bismark` alignment?

It's been a while since I've looked at MultiQC reports for trimmed data, so I want to make sure that I didn't miss anything when reviewing the MultiQC report for the trimmed Hawaii data to look at sample quality post-trimming.

The per sequence GC content failed for all samples. Most samples had slight abnormalities or very unusual overrepresented sequences. All samples had slightly abnormal sequence length distributions, and most had slightly abnormal per tile sequence quality scores.

Screen Shot 2021-01-14 at 11 32 08 AM

Screen Shot 2021-01-14 at 11 32 28 AM

Screen Shot 2021-01-14 at 11 32 56 AM

Screen Shot 2021-01-14 at 11 33 04 AM

In looking at some of the FastQC reports from Sam, I noticed that the per sequence GC content remained the same before and after trimming for several samples. Many of the samples with overrepresented sequences were the 2nd paired files, and had a sequence with no hit.

Screen Shot 2021-01-14 at 2 21 41 PM

Per sequence GC content, sequence length distribution, and overrepresented sequences were potential problem areas with the coral MethCompare data as well. I also looked at my C. virginica gonad methylation data MultiQC report. Most samples had slightly unusual per sequence GC content and per base sequence content, and all samples had slighty unusual sequence length distribution. So is this an issue to worry about?

yaaminiv commented 3 years ago

Based on science hour discussion:

I need to re-trim my files. There are still some adapter sequences in the files (based on overrepresented sequences in R1 files), which makes sense because of the way trimgalore works. Most trimgalore jobs require two rounds of trimming.
I can do a third round of trimming to remove the poly-G tail (--adapter GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG and --adapter2 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG) while waiting from additional information from Zymo

hputnam commented 3 years ago

I am working on this with Zymo now. They have asked me to upload raw data so they can try to trouble shoot. Here is my conversation with illumina about this issue. "Also, I have a suspicion about the cause of the poor/unexpected read 2 performance: in the past, I have seen similar library preparation approaches omit the first base of the adapter (the base that touches the insert sequence) on the p7 end. When the read 2 primer binds, the last base does not basepair, so we get poor synthesis from the read 2 primer. I wanted to check the primer sequences from Zymo to see if this base is missing. This base is commonly missing when libraries are prepared with TruSeq-style adapter sequences added via PCR, which is what we see in the workflow. Adding in the extra base during library prep completely rescues the read 2 performance. I can’t say for certain if this is the issue without seeing the sequences, though I think it is pretty likely. "

laurahspencer commented 3 years ago

I'm interested to see what ya'll find out what might have caused poly-G tails in your data. I also have poly-G tails in my QuantSeq (aka TagSeq) libraries. Various threads online indicate that Poly-G tails may be due to diminishing signal during sequencing, which is read as a "G" in two-color sequencing systems (like what NovaSeq uses, which is what sequenced my libraries). Based on my online research, poly-G tails seemed to not be a big issue, and therefore could simply be trimmed.

hputnam commented 3 years ago

@laurahspencer yes, I think they can be trimmed without issue. The biggest issue for us is we are losing a substantial fraction of our reads, thus data loss and $$$$ loss.

Zymo is looking at our fastq data now. @yaaminiv can you please send me your adapter/barcode sequences. I think Zymo designed them for my lab and for you as well.

yaaminiv commented 3 years ago

The biggest issue for us is we are losing a substantial fraction of our reads, thus data loss and $$$$ loss.

@hputnam How much of your data are you losing? I did some rough calculations here and manually trimming the poly-G tail leads to 1-3% read loss in my samples.

can you please send me your adapter/barcode sequences.

Here is a file with adapter sequences trimmed out of my data (obtained from fastqc trimming reports): https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/Haws_01-trimgalore/adapter-sequences.txt

kubu4 commented 3 years ago

You should also consider the loss of number of bases from each read. The poly G is present after adapter removal, which indicates it's a part of the "sequence" data that you'd want. So, you're losing the equivalent number of bases from each read with that poly G sequence.

kubu4 commented 3 years ago

Response from Zymo:

Hi, Sam.

We were looking into your question. Was it a small percentage of the reads? We checked some FastQC stats over here and observed something similar. I wonder if it has to do with the 2-channel chemistry used in the newer Illumina NovaSeq. We found this blog post describing a very similar issue, https://www.dna-ghost.com/single-post/2018/01/23/Be-careful-the-poly-G-sequence-from-NextSeq-run

That would explain it. And if it is only a smaller number of reads that are affected, then trimming or filtering out should be good enough to resolve any issues.

Best Regards,

-Keith

sr320 commented 3 years ago

resolved?

hputnam commented 3 years ago

Not on my end. While the NovaSeq and NextSeq systems with the 2 color chemistry show this issue, they are not ultimately responsible. Zymo is looking at the data now.

RobertsLab / resources