cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

G2P zero-length errors caused by soft clipping #321

Closed donkirkby closed 8 years ago

donkirkby commented 8 years ago

Several samples in the 2 May 2016 run had over 1000 zerolength errors in the g2p.csv file. I looked at sample 71244A-HLA-B-PAAL0412-5-2-V3LOOP_S54, and found that there were two common sequences that just matched the ends of the two successful common sequences. The ends were beyond the V3LOOP region, so they were zero-length sequences after clipping to V3LOOP.

I went back to the mapping step and looked at one read: M01841:228:000000000-AM7C0:1:1101:7615:4936 that only mapped at the end. The forward and reverse mates mapped to the exact same position, with a lot of soft clipping. It also looks like the forward and reverse reads might be swapped.

For comparison, this read mapped normally: M01841:228:000000000-AM7C0:1:1101:16462:1751

Why are these reads getting soft clipped? Does the clipped portion have something strange in it, or are the forward and reverse portions just swapped?

donkirkby commented 8 years ago

I looked at the reads, and found very strange results. The upstream end of the forward read (8 codons) and the downstream end of the reverse read (49 codons) match the read that mapped normally. The middle 45 codons seem to be a mangled version of a downstream part of the envelope region .

donkirkby commented 8 years ago

It looks like this sample had an unusual sequence that mapped to the forward primer in two places. Reads that mapped in the regular place appeared as normal reads that covered the whole V3LOOP region. Reads that mapped the forward primer downstream of V3LOOP appeared as a small portion of envelope, and didn't cover the V3LOOP region at all. We found a portion of the reads that matches the TAATAGTACA at the end of the V3 primers.

This sample is so unusual that only four reads mapped to the HIV1B-env seed in preliminary mapping. Those reads changed the consensus enough that 10,000 reads mapped in later rounds. Those four reads were probably contamination from other samples.

Now that we understand what happened, I'm closing the issue.