andersen-lab / ivar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
https://andersen-lab.github.io/ivar/html/
GNU General Public License v3.0
115 stars 39 forks source link

iVAR inappropriately shifts position of reverse mapped reads for ONT data #175

Closed Alan-Collins closed 1 month ago

Alan-Collins commented 4 months ago

When processing ONT reads, iVar shifts the position of the read for reverse mapped reads. This results in incorrect alignment of the reads with the reference assembly. iVar seems to only change the position of reads that are clipped.

Steps to reproduce:

For the below example, ONT reads (SRR21789993) were mapped to a reference (NC_045512.2) using minimap2 version 2.27-r1193 and were then processed with samtools version 1.19.2 and iVar version 1.4.2.

  1. reads were mapped using minimap2 with minimap2 -ax map-ont wuhan.fa SRR21789993.fastq.gz
  2. output was sorted and the BAM processed with iVar with ivar trim -i SRR21789993.sorted.bam -m 30 -q 5 -e -b neb_vss2a.primer.bed -p ivar > ivar.log The bed file used is here.

The issue can be clearly seen in IGV. The blow screenshot shows a region of mapping with the original BAM in the top track and the BAM output by iVar in the bottom track.

image

After running iVar, reads now map to a region just 3' of the area with coverage in the top track. The reads mapping to this 3' region are all mapped in reverse. Further down, it can be seen that most (but not all) reverse mapped reads both extend further 3' than they should and do not align well with the reference,

image

Examining BAM entry for some of these mismapped reads indicates that they are now mapping further 3' in the reference and their cigar strings are modified. For example:

Original BAM

samtools view -N <(echo SRR21789993.141467) SRR21789993.sorted.bam| cut -f1,4,6 SRR21789993.141467 12646 53M1I140M1I217M1D112M1D1M1D3M1D14M57S

iVar output BAM

samtools view -N <(echo SRR21789993.141467) ivar.sorted.bam| cut -f1,4,6 SRR21789993.141467 12875 53M1I140M1I173M231S

This read should not have been moved as its CIGAR string begins with "M". However, it was moved to account for added soft clipping on the end of the CIGAR string.

Some reverse mapped reads were not shifted. These seem to be cases where iVar did not introduce any soft clipping. For example:

samtools view -N <(echo SRR21789993.118934) SRR21789993.sorted.bam| cut -f1,4,6 SRR21789993.118934 12633 66S19M7D134M1D32M1I7M2D3M1D93M1D164M1I5M2D73M samtools view -N <(echo SRR21789993.118934) ivar.sorted.bam | cut -f1,4,6 SRR21789993.118934 12633 66S19M7D134M1D32M1I7M2D3M1D93M1D164M1I5M2D73M

cmaceves commented 3 months ago

hey - thank you for providing all the info and files, we'll take a look!

jessebyoder commented 3 months ago

hi @cmaceves! do you have any timeline on when a fix for this issue might be released? Thank you!

cmaceves commented 3 months ago

hey @jessebyoder ! this is in progress now, but I'd estimate 2-3 weeks until a fix is pushed.

cmaceves commented 2 months ago

hey @Alan-Collins - I just pushed a fix for this to the branch issue_175. If you wouldn't mind testing it to confirm you're getting the desired output, that would be great!

Alan-Collins commented 2 months ago

Hi @cmaceves, I tested the branch issue_175 but still see the same issue. Did you not see the issue with that branch? Perhaps I have made a mistake in testing?

I built the version of ivar on the issue_175 branch by modifying the dockerfile on your repo's main branch to clone the branch you specified instead of master. i.e., line 29 of the dockerfile was changed to git clone -b issue_175 https://github.com/andersen-lab/ivar.git &&\. I then used the docker image to test ivar.

I'm still seeing the reverse reads shifted so that they don't line up with the reference correctly. You can see it in the attached IGV screenshot showing the ivar output in the top track and the input bam in the bottom track. I ran the same commands I specified in the original issue using the same files.

image

Alan-Collins commented 2 months ago

I wonder if the issue might have something to do with this reverse_cigar function? I haven't worked in C++ before to be able to work through the code to figure out exactly what it is doing and why iVAR would need to reverse the CIGAR string. However, I wonder if soft clipping the right end of the read (3' relative to the reference), which adds soft clips to the right end of the CIGAR is then being treated as if the read's left end (5' relative to the reference) had been shifted because the cigar was processed in reverse?

cmaceves commented 1 month ago

hey, apologies for the delay, I'm looking into this now! @Alan-Collins

cmaceves commented 1 month ago

we've pushed additional fixes to the same branch, if you could test it out that would be great! @Alan-Collins

Alan-Collins commented 1 month ago

Hi Chrissy,

Thanks! It looks like the shifting issue is fixed in the latest commit on that branch. However, it doesn't look like ivar is trimming the primer of the right end of reverse mapped reads. Would you prefer I opened a different issue for this behavior?

Some description of what I mean (using the same files that I described in the first comment on this issue):

Looking at the depth histogram for an amplicon before and after iVAR, you can see that the left of the amplicon, where a primer is found, has been removed, but the right side less so.

image

Looking at the read pileup, it looks like iVAR is trimming both primers from forward reads, but only the left primer from reverse reads. A small number of reads are remaining untrimmed in both directions.

image

gkarthik commented 1 month ago

Hey @Alan-Collins , on a quick glance, it seems like the reverse primers are trimmed from the right of the reverse reads according to the BED file. If you could you please point to specific coordinates or sets of reads where the primers are not trimmed, that would really help us debug this issue.

Alan-Collins commented 1 month ago

My apologies. I didn't look closely enough. There isn't a strand-based lack of trimming.

However, I am surprised to see so many reads that still map to primer binding regions. Specifically, after running iVAR, most of the remaining reads still map to a primer region

$ samtools view -c ivar.sorted.bam 71379 $ samtools view --region-file neb_vss2a.primer.bed -c ivar.sorted.bam 62351

Is this the expected behavior or shouldn't iVAR be trimming all reads to remove primer sequences?

An example: After running iVAR, 3 reads map to the second primer in the file linked in my original post. The primer entry in the BED file is NC_045512.2 418 443 varskip-0317-1_02_LEFT 1 +

The first 4 columns of the SAM entries for those reads are

SRR21789993.85486 0 NC_045512.2 418 SRR21789993.134924 0 NC_045512.2 418 SRR21789993.156347 0 NC_045512.2 418

i.e., all three reads map to the first position of the primer onwards and contain the whole forward primer.

gkarthik commented 1 month ago

Oh no worries! Thank you for going over this with us.

In the example with pos 418 you pointed out, the primer actually starts at position 419 (BED file format is 0-based and samtools view is 1-based; please see figure(s) below). For illumina data, we typically use an exact match to trim those reads that start within the primer region. We did however, build a fuzzier version of this approach where we allow for the read to start a few bases before the start of the primer. Please see the example below with -x 5 wherein we allow reads to start 5 bases before the primer position for it to be trimmed. Maybe this option could be helpful in this case?

Exact match,

Trimmed primers from 80.26% (140687) of reads.
77.5% (135850) of reads were quality trimmed below the minimum length of 304 bp and were not written to file.
2.79% (4892) of reads started outside of primer regions. Since the -e flag was given, these reads were written to file.
100% (175284) of reads had their insert size smaller than their read length

image

Fuzzy match with -x 5,

Trimmed primers from 94.87% (166288) of reads.
78.04% (136800) of reads were quality trimmed below the minimum length of 304 bp and were not written to file.
0.22% (377) of reads started outside of primer regions. Since the -e flag was given, these reads were written to file.
100% (175284) of reads had their insert size smaller than their read length

image

Alan-Collins commented 1 month ago

Great! Thank you for explaining and for the work to resolve this issue!