Closed yaaminiv closed 3 years ago
Based on science hour discussion:
trimgalore
works. Most trimgalore
jobs require two rounds of trimming.--adapter GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG and --adapter2 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
) while waiting from additional information from ZymoI am working on this with Zymo now. They have asked me to upload raw data so they can try to trouble shoot. Here is my conversation with illumina about this issue. "Also, I have a suspicion about the cause of the poor/unexpected read 2 performance: in the past, I have seen similar library preparation approaches omit the first base of the adapter (the base that touches the insert sequence) on the p7 end. When the read 2 primer binds, the last base does not basepair, so we get poor synthesis from the read 2 primer. I wanted to check the primer sequences from Zymo to see if this base is missing. This base is commonly missing when libraries are prepared with TruSeq-style adapter sequences added via PCR, which is what we see in the workflow. Adding in the extra base during library prep completely rescues the read 2 performance. I can’t say for certain if this is the issue without seeing the sequences, though I think it is pretty likely. "
I'm interested to see what ya'll find out what might have caused poly-G tails in your data. I also have poly-G tails in my QuantSeq (aka TagSeq) libraries. Various threads online indicate that Poly-G tails may be due to diminishing signal during sequencing, which is read as a "G" in two-color sequencing systems (like what NovaSeq uses, which is what sequenced my libraries). Based on my online research, poly-G tails seemed to not be a big issue, and therefore could simply be trimmed.
@laurahspencer yes, I think they can be trimmed without issue. The biggest issue for us is we are losing a substantial fraction of our reads, thus data loss and $$$$ loss.
Zymo is looking at our fastq data now. @yaaminiv can you please send me your adapter/barcode sequences. I think Zymo designed them for my lab and for you as well.
The biggest issue for us is we are losing a substantial fraction of our reads, thus data loss and $$$$ loss.
@hputnam How much of your data are you losing? I did some rough calculations here and manually trimming the poly-G tail leads to 1-3% read loss in my samples.
can you please send me your adapter/barcode sequences.
Here is a file with adapter sequences trimmed out of my data (obtained from fastqc
trimming reports): https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/Haws_01-trimgalore/adapter-sequences.txt
You should also consider the loss of number of bases from each read. The poly G is present after adapter removal, which indicates it's a part of the "sequence" data that you'd want. So, you're losing the equivalent number of bases from each read with that poly G sequence.
Response from Zymo:
Hi, Sam.
We were looking into your question. Was it a small percentage of the reads? We checked some FastQC stats over here and observed something similar. I wonder if it has to do with the 2-channel chemistry used in the newer Illumina NovaSeq. We found this blog post describing a very similar issue, https://www.dna-ghost.com/single-post/2018/01/23/Be-careful-the-poly-G-sequence-from-NextSeq-run
That would explain it. And if it is only a smaller number of reads that are affected, then trimming or filtering out should be good enough to resolve any issues.
Best Regards,
-Keith
resolved?
Not on my end. While the NovaSeq and NextSeq systems with the 2 color chemistry show this issue, they are not ultimately responsible. Zymo is looking at the data now.
Question: Are issues with per sequence GC content, overrepresented sequences, sequence length distributions, and per tile sequence quality scores reasons to re-trim data/not progress with
bismark
alignment?It's been a while since I've looked at MultiQC reports for trimmed data, so I want to make sure that I didn't miss anything when reviewing the MultiQC report for the trimmed Hawaii data to look at sample quality post-trimming.
The per sequence GC content failed for all samples. Most samples had slight abnormalities or very unusual overrepresented sequences. All samples had slightly abnormal sequence length distributions, and most had slightly abnormal per tile sequence quality scores.
In looking at some of the FastQC reports from Sam, I noticed that the per sequence GC content remained the same before and after trimming for several samples. Many of the samples with overrepresented sequences were the 2nd paired files, and had a sequence with no hit.
Per sequence GC content, sequence length distribution, and overrepresented sequences were potential problem areas with the coral MethCompare data as well. I also looked at my C. virginica gonad methylation data MultiQC report. Most samples had slightly unusual per sequence GC content and per base sequence content, and all samples had slighty unusual sequence length distribution. So is this an issue to worry about?