andersen-lab / ivar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
https://andersen-lab.github.io/ivar/html/
GNU General Public License v3.0
118 stars 40 forks source link

Filter based on insert size #73

Closed gkarthik closed 4 years ago

gkarthik commented 4 years ago

Problem: iVar by default trims forward primers form + strand and reverse from - strand. However, if the insert size is smaller than the read length, it should also trim reverse primers from the + strand and vice versa. This should not affect the consensus since the ~90% of insert sizes >= 200. However, it might cause issues if the amplicon size is very small or if there are issues during library perp and during downstream variant calling.

More details in #72 raised by @jts.

                              2_LEFT---->
cDNA       -----------------------------------------------------------------
                                                           <----- 2_RIGHT

Amplicon                      2_LEFT------------------------------2_RIGHT

Fragment                      2_LEFT-------------------//-----------2_RIGHT

Sequence                      //-------------------------2-RIGHT (+)
                              //-------------------------2-RIGHT (-)

Potential fix:

  1. Implement an insert size based filter in the trimming step.
  2. If a read starts before the overlapping regions of a pair amplicons, determining its original amplicon will be straight forward. In this case, we can just trim off the primers from the opposite strand.
  3. If a read starts within the overlap, we should just exclude these reads as determining its original amplicon is not straight forward.
m-bull commented 4 years ago

This would be a great feature - adding that we occaisionally see impossibly large insert sizes (> a single amplicon), perhaps from chimeras, so being able to set minimum and maximum insert sizes would be nice.

jts commented 4 years ago

Adding to what @m-bull said, we also see reads that have what should be impossible alignments, like starting in sequence that is unique to an amplicon from pool1, and ending in sequence unique to the adjacent amplicon in pool2. It would be great if these could be filtered as well as they are likely aberrant amplicons.

jts commented 4 years ago

Hi all,

Could I get an update on this issue?

Jared

AlaaALatif commented 4 years ago

Hi Jared,

I haven't had time to start on this issue yet, but planning to do so as soon as I can. Definitely by this week. Will keep you updated.

Cheers, Al

jts commented 4 years ago

Hi all,

I've looked into this issue a little more and have developed a set of test reads to investigate the trimming behaviour for certain arrangements of pairs. Here are the reads I am using:

>read1_full_amplicon_readthrough/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCACTCAAAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTATCTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAGAACAACCTACTAGTGAAGCTGTTGAAGC
>read1_full_amplicon_readthrough/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTACAATCCCTTTGAGTGCGTGACAAATGTTTCACCTAAATTCAAGGCTTTAAGTTTAGCTCCACCAATAATGATAGAGTCAGCACACAAAGCCAAAAATTTATTTACAAGCTTAAAGAATGTCTGAACACTCTCCTTAATTTCCTTTGCACAGGTGACAATTTGTCCACCGACAATTTCACAAGCACAGGTTGAGATAAATTTAACAATTTCCCAACCGTCTCTAAGAAACTCT
>read2_full_amplicon_disjoint/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTA
>read2_full_amplicon_disjoint/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTAC
>read3_left_readthrough/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTA
>read3_left_readthrough/2
TAGCTCCACCAATAATGATAGAGTCAGCACACAAAGCCAAAAATTTATTTACAAGCTTAAAGAATGTCTGAACACTCTCCTTAATTTCCTTTGCACAGGTGACAATTTGTCCACCGACAATTTCACAAGCACAGGTTGAGATAAATTTAACAATTTCCCAACCGTCTCTAAGAAACTCT
>read4_right_readthrough/1
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTAC
>read4_right_readthrough/2
GTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTATCTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAGAACAACCTACTAGTGAAGCTGTTGAAGC
>read5_left_ambiguous/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCA
>read5_left_ambiguous/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAA
>read6_right_ambiguous/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCA
>read6_right_ambiguous/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAA

test_trimming_reads

I've attached an image above showing how these reads align to the reference genome and their relationship to the amplicon primers. Here's a description of each test case:

read1_full_amplicon_readthrough: the entire amplicon is read from end to end in both reads of the pair, fully overlapping along the entire read length. A LEFT and RIGHT primer should be trimmed from both reads. read2_full_amplicon_disjoint: the read pairs start at the ends of the amplicon, and do not overlap. LEFT should be trimmed from the first read, RIGHT should be trimmed from the second read. This is the typical arrangement for ligation based protocols. read3_left_readthrough: the read pairs start at the left end only, and perfectly overlap. LEFT should be trimmed from both. read4_right_readthrough: the read pairs start at the right end only, and perfectly overlap. RIGHT should be trimmed from both. read5_left_ambiguous: the first read in the pair starts in a LEFT primer and ends in the RIGHT pair for the previous amplicon. The second read in the pair does not overlap the first read and ends in a RIGHT primer. LEFT should be trimmed from the first read, RIGHT should be trimmed from the second read. read6_right_ambiguous: same situation as above, except the right hand side is ambiguous instead of the left. LEFT should be trimmed from the first read, RIGHT should be trimmed from the second read.

Here is how ivar 1.2.3 trims these reads, with my assessment of the correctness:

# RIGHT primer missed
read1_full_amplicon_readthrough/1 97      MN908947.3      2206    60      24S387M

# LEFT primer missed
read1_full_amplicon_readthrough/2 145     MN908947.3      2182    60      387M24S

# Correct
read2_full_amplicon_disjoint/1    97      MN908947.3      2206    60      24S155M

# Correct
read2_full_amplicon_disjoint/2    145     MN908947.3      2413    60      156M24S

# Correct
read3_left_readthrough/1  97      MN908947.3      2206    60      24S155M

# LEFT primer missed
read3_left_readthrough/2  145     MN908947.3      2182    60      179M

# Correct
read4_right_readthrough/1 81      MN908947.3      2413    60      156M24S

# RIGHT primer missed
read4_right_readthrough/2 161     MN908947.3      2413    60      180M

# Correct
read5_left_ambiguous/1    97      MN908947.3      2206    60      24S64M

# Correct
read5_left_ambiguous/2    145     MN908947.3      2505    60      64M24S

# Correct
read6_right_ambiguous/1   97      MN908947.3      2206    60      24S64M

# Correct
read6_right_ambiguous/2   145     MN908947.3      2505    60      64M24S

In my fork of ivar I have added a command line argument that forces ivar to treat both halves of the pair independently allowing it to trim both primers from reads. This fixes the missed primers for read1, read3 and read4. It overtrims read5 and read6 but this situation should be rare and overtrimming is preferable to undertrimming:

# Correct
read1_full_amplicon_readthrough/1 97      MN908947.3      2206    60      24S363M24S

# Correct
read1_full_amplicon_readthrough/2 145     MN908947.3      2206    60      24S363M24S

# Correct
read2_full_amplicon_disjoint/1    97      MN908947.3      2206    60      24S155M

# Correct
read2_full_amplicon_disjoint/2    145     MN908947.3      2413    60      156M24S

# Correct
read3_left_readthrough/1  97      MN908947.3      2206    60      24S155M

# Correct
read3_left_readthrough/2  145     MN908947.3      2206    60      24S155M

# Correct
read4_right_readthrough/1 81      MN908947.3      2413    60      156M24S

# Correct
read4_right_readthrough/2 161     MN908947.3      2413    60      156M24S

# Overtrimmed, both primers removed
read5_left_ambiguous/1    97      MN908947.3      2206    60      24S42M22S

# Overtrimmed, both primers removed
read5_left_ambiguous/2    145     MN908947.3      2529    60      24S40M24S

# Overtrimmed, both primers removed
read6_right_ambiguous/1   97      MN908947.3      2206    60      24S42M22S

# Overtrimmed, both primers removed
read6_right_ambiguous/2   145     MN908947.3      2529    60      24S40M24S
gkarthik commented 4 years ago

Hey! Thank you for sending over this test dataset. We are doing a check if the insert size is less than the read length and in those cases, we are excluding the reads. We should have it added within the next 2 days ( We are testing it now).

I went through the flag you added, wouldn't you end up over trimming for reads with insert sizes > read length as well?

jts commented 4 years ago

We are doing a check if the insert size is less than the read length and in those cases, we are excluding the reads.

Are you going to exclude them, or trim them without considering pairing? With nextera you have valid reads with a short insert if the adapters are inserted within 150bp from an amplicon end, so I you will lose a lot of coverage if you exclude

wouldn't you end up over trimming for reads with insert sizes > read length as well

I'm not sure I follow this case, can you clarify?

AlaaALatif commented 4 years ago

Hi Jared, We are modifying ivar trim so it identifies if insert size is less than read length, and trims them without considering pairing.

Using the modified code under this branch, we get the expected trimming behavior for all cases (see output below):

# Correct
read1_full_amplicon_readthrough/1   97  NC_045512.2 2206    60  24S363M24S  

# Correct
read1_full_amplicon_readthrough/2   145 NC_045512.2 2206    60  24S363M24S  

# Correct
read2_full_amplicon_disjoint/1  97  NC_045512.2 2206    60  24S155M 

# Correct
read2_full_amplicon_disjoint/2  145 NC_045512.2 2413    60  156M24S

# Correct
read3_left_readthrough/1    97  NC_045512.2 2206    60  24S155M 

# Correct
read3_left_readthrough/2    145 NC_045512.2 2206    60  24S155M 

# Correct
read4_right_readthrough/2   81  NC_045512.2 2413    60  156M24S

# Correct
read4_right_readthrough/2   161 NC_045512.2 2413    60  156M24S

# Correct
read5_left_ambiguous/1  97  NC_045512.2 2206    60  24S64M

# Correct
read5_left_ambiguous/2  145 NC_045512.2 2505    60  64M24S

# Correct
read6_right_ambiguous/1 97  NC_045512.2 2206    60  24S64M

# Correct
read6_right_ambiguous/2 145 NC_045512.2 2505    60  64M24S  

We will aim to add this to a new version of iVar as soon as possible, hopefully by tomorrow (testing in progress).

Thank you for sharing those test reads! They were very helpful.

Cheers, Al

jts commented 4 years ago

Thanks @AlaaALatif and @gkarthik!

jts commented 4 years ago

Heya,

I'm just testing this out and its a big improvement! I've found some reads that should be trimmed are missed though, here's an example:

>read9/1
TCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATGACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTTGATTGTTACGATGGTGGCTGTATTAATG
>read9/2
GCATTAATACAGCCACCATCGTAACAATCAAAGTACTTATCAACAACTTCAACTACAAATAGTAGTTGTCTGATATCACACATTGTTGGTAGATTATAACGATAGTAGTCATAATCGCTGATAGCAGCATTACCATCCTGAGCAAAGA

ivar-1.2.4 trim result:

read9   97      MN908947.3      14762   60      147M    =       14762   148     TCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATGACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTTGATTGTTACGATGGTGGCTGTATTAATG *       NM:i:0  MD:Z:147        AS:i:147        XS:i:0
read9   145     MN908947.3      14762   60      137M11S =       14762   -148    TCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATGACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTTGATTGTTACGATGGTGGCTGTATTAATGC        *       NM:i:0  MD:Z:148        AS:i:148        XS:i:0  XA:i:115

These reads overlap along most of their length so the primer should be trimmed from both sequences, instead of just from read9/2

AlaaALatif commented 4 years ago

Heyo,

Thank you for following up and sharing the additional edge case. I've modified ivar trim even further under this branch so that primer trimming occurs properly for all test cases (see output on new test cases below).

read1_full_amplicon_readthrough 97  NC_045512.2 2206    60  24S363M24S  
read2_full_amplicon_disjoint    97  NC_045512.2 2206    60  24S155M 
read3_left_readthrough  97  NC_045512.2 2206    60  24S155M 
read5_left_ambiguous    97  NC_045512.2 2206    60  24S64M  
read6_right_ambiguous   97  NC_045512.2 2206    60  24S64M  
read1_full_amplicon_readthrough 145 NC_045512.2 2206    60  24S363M24S  
read3_left_readthrough  145 NC_045512.2 2206    60  24S155M 
read4_right_readthrough 161 NC_045512.2 2413    60  156M24S
read2_full_amplicon_disjoint    145 NC_045512.2 2413    60  156M24S
read4_right_readthrough 81  NC_045512.2 2413    60  156M24S
read5_left_ambiguous    145 NC_045512.2 2505    60  64M24S  
read6_right_ambiguous   145 NC_045512.2 2505    60  64M24S  
read9   97  NC_045512.2 14762   60  137M10S
read9   145 NC_045512.2 14762   60  137M11S 

However there still might be very specific cases where over trimming may sometime occur. Still, we may end up adding this to the next release because conservative behavior should be preferred over the alternative.

Cheers :) Al

jts commented 4 years ago

Thanks @AlaaALatif, I have run this on ~2500 samples and it now fixes all of the problematic sites I have identified so far, whereas 1.2.4 did not. There are a few new variants that pop up that I want to look at, but I think they are just churn around the allele frequency threshold of 0.25.

I have a second request but I will open a new issue for this as its unrelated to trimming.

jts commented 4 years ago

@AlaaALatif I have found a corner case that I don't understand where a read containing a deletion is incorrectly trimmed. Oddly it only happens when I provide quality strings, so perhaps is an interaction with the quality trimmer. Here is a test case:

@read10_deletion/1
ACGCAGAAGGGAGCAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCTCATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTAGGGGAACTTCTCCTGCTAGCAATGGCGGTGATGCTGCGCTTGCTTTGCTGCTGCTT
+
AAAA33>>FFABGEE4GGFEG222EHCGG3A1A1BFHHF1FFGEFGFBEFHG1?F1D?EEEGHBGDF@FB33F4G@F@GH3FGC3??EA?BGD43/E?/?GF441BGHHHFHEHHH110/<A//?11?11/->CFC<<1>FGD1<0<0<
@read10_deletion/2
GTTTCTTCGAGGCCTCAGCAGCCGATTTCTTCGAGACAGCTTGGCCTTGTTGTTGTTGGCCTTTACCTGACATTTTGCTCTCATGCTGCTTCAATCTGTCAAGCAGCAGCAAAGCAAGAGCAGCAACA
+
B00G<F1E0HHGB001B0F00G0B112FB2211EF1GB11F@B121111B1@22F@B0011BGF11G1EHGB00GB0A11F1AB/01A/00?FA002F220B000000FBB11GF111113B3>>>11

The resulting bam file is:

99      MN908947.3      28788   60      111M9D5M33S     =       28918   258     ACGCAGAAGGGAGCAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCTCATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTAGGGGAACTTCTCCTGCTAGCAATGGCGGTGATGCTGCGCTTGCTTTGCTGCTGCTT       AAAA33>>FFABGEE4GGFEG222EHCGG3A1A1BFHHF1FFGEFGFBEFHG1?F1D?EEEGHBGDF@FB33F4G@F@GH3FGC3??EA?BGD43/E?/?GF441BGHHHFHEHHH110/<A//?11?11/->CFC<<1>FGD1<0<0<       NM:i:10 MD:Z:111^GAATGGCTG19T18 AS:i:129        XS:i:0
147     MN908947.3      29034   60      116S8M4S        =       28788   -258    TGTTGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAAGCAGCATGAGAGCAAAATGTCAGGTAAAGGCCAACAACAACAAGGCCAAGCTGTCTCGAAGAAATCGGCTGCTGAGGCCTCGAAGAAAC    11>>>3B311111FG11BBF000000B022F200AF?00/A10/BA1F11A0BG00BGHE1G11FGB1100B@F22@1B111121B@F11BG1FE1122BF211B0G00F0B100BGHH0E1F<G00B        NM:i:11 MD:Z:2A36C4T15T27A5A1T8T11T2T5G1    AS:i:79 XS:i:0  XA:i:211

In this case neither softclip that was applied to the reads makes sense.

Thanks again for all your work on this.

gkarthik commented 4 years ago

Hello Jared,

The last 33 bps of the first read are trimmed off because the quality drops below the default threshold of 20. In case of the reverse read, the quality drops below 20 after 12 bases from the 5' end of the reverse read. The last 4 bases are soft clipped because of an overlap with a primer region.

I think in both cases this is the expected behavior?

jts commented 4 years ago

Yes sorry, you're right, I get the same result with ivar-1.2.3. I see new low quality variant calls following this deletion, so I thought this read demonstrated the problem - I'll look into this further.

jts commented 4 years ago

Here's a fixed demonstration of the issue I tried to raise above:

@read10_deletion/1
CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG
+
AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG
@read10_deletion/2
CTTTTTGAGGGTTATGATTTTGGAAGCGCTATGAAAAACAGCAATAAGCCATCCGAAAGGGAGTGAGGCTTGTATCGGTATCGTTGCAGTAGCGCGAACAAAATCTGAAGGAGTAGCATCCTTGATTTCACCTTGCTTCAAAGTTACAG
+
DGFGG<<<1BF?F1FFCFGDHFHFFFGGFHGGF32B/0C2FEHHG3HFGCGE>//?DFFG3HGHEHDGEG>?BGFD1EGGECDBHFEEDGDEE?DFHHHHHFGFGBHFHHGFHHHFB?GGGGDHHFAGFGGGGG4GGABFA33BAA3>3

Bam prior to trimming:

read10_deletion 99      MN908947.3      25427   60      105M18D44M      =       25427   167     CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG       AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG   NM:i:19     MD:Z:105^TTGTTGGCGTTGCACTTC13G30        MC:Z:105M18D44M AS:i:120        XS:i:0
read10_deletion 147     MN908947.3      25427   60      105M18D44M      =       25427   -167    CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG       3>3AAB33AFBAGG4GGGGGFGAFHHDGGGG?BFHHHFGHHFHBGFGFHHHHHFD?EEDGDEEFHBDCEGGE1DFGB?>GEGDHEHGH3GFFD?//>EGCGFH3GHHEF2C0/B23FGGHFGGFFFHFHDGFCFF1F?FB1<<<GGFGD   NM:i:19     MD:Z:105^TTGTTGGCGTTGCACTTC13G30        MC:Z:105M18D44M AS:i:120        XS:i:0

Bam trimmed by ivar-1.2.3:

read10_deletion 99      MN908947.3      25427   60      105M18D44M      =       25427   167     CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG       AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG   NM:i:19     MD:Z:105^TTGTTGGCGTTGCACTTC13G30        MC:Z:105M18D44M AS:i:120        XS:i:0
read10_deletion 147     MN908947.3      25560   60      115S34M =       25427   -167    CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG       3>3AAB33AFBAGG4GGGGGFGAFHHDGGGG?BFHHHFGHHFHBGFGFHHHHHFD?EEDGDEEFHBDCEGGE1DFGB?>GEGDHEHGH3GFFD?//>EGCGFH3GHHEF2C0/B23FGGHFGGFFFHFHDGFCFF1F?FB1<<<GGFGD   NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30    MC:Z:105M18D44M AS:i:120        XS:i:0

Bam trimmed by insert_sizes branch:

read10_deletion 99      MN908947.3      25427   60      105M18D44M      =       25427   167     CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG       AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG   NM:i:19     MD:Z:105^TTGTTGGCGTTGCACTTC13G30        MC:Z:105M18D44M AS:i:120        XS:i:0
read10_deletion 147     MN908947.3      25427   60      115S34M =       25427   -167    CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG       3>3AAB33AFBAGG4GGGGGFGAFHHDGGGG?BFHHHFGHHFHBGFGFHHHHHFD?EEDGDEEFHBDCEGGE1DFGB?>GEGDHEHGH3GFFD?//>EGCGFH3GHHEF2C0/B23FGGHFGGFFFHFHDGFCFF1F?FB1<<<GGFGD   NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30    MC:Z:105M18D44M AS:i:120        XS:i:0

Note that the same trimming has been applied (115S) but the start position of the read has not been updated in the new version of ivar (it is 25427 but should be 25560 like in ivar-1.2.3).

AlaaALatif commented 4 years ago

Hey, thanks for giving an example of the issue. I just pushed a potential fix to insert_size. Still need to do more tests but thought I would share it for any useful feedback you may have :)

Cheers, Al

jts commented 4 years ago

Thanks Al, that fixed the issue I had around indels.

jts commented 4 years ago

Hi @AlaaALatif and @gkarthik, is there a timeline for making a new release of ivar with this change and ideally #75? We'd like to update our pipelines soon prior to a new data release

AlaaALatif commented 4 years ago

Hi @jts, yes we're currently working on this. The plan is to make a release by end of this week. Hope that works! :)

jts commented 4 years ago

That would be great, thanks!

AlaaALatif commented 4 years ago

Hi again, just wanted to update that we're adding final tests for #75 before making a release asap.

Thanks

jts commented 4 years ago

Thanks @AlaaALatif! I had a look at the amplicon_search branch and noticed it is checking whether individual reads (not pairs) are within the amplicon:

https://github.com/andersen-lab/ivar/blob/amplicon_search/src/trim_primer_quality.cpp#L336

In my python prototype I check whether the paired-end fragment is within the amplicon instead, which will filter more artifacts out:

        fragment_start = alignment.reference_start
        fragment_end = 0
        tlen = alignment.template_length
        if tlen < 0:
            fragment_start = alignment.reference_end
        fragment_end = fragment_start + tlen

        start_intervals = tree.at(fragment_start)
        end_intervals = tree.at(fragment_end)
jts commented 4 years ago

Hi,

I'm testing the latest change to the branch - I think:

    fragment_coords.low = r->core.pos + bam_endpos(r) + r->core.isize;
    fragment_coords.high = r->core.pos + bam_endpos(r);

should be:

    fragment_coords.low = bam_endpos(r) + r->core.isize;
    fragment_coords.high = bam_endpos(r);

as bam_endpos already includes r->core.pos.

AlaaALatif commented 4 years ago

Yes I agree, I changed it accordingly. Thanks!

jts commented 4 years ago

Thanks. I have tested a merge between the insert_sizes and amplicon_search (with the endpos fix) on >2000 genomes and am happy with the results. It would be good to get the opinion of others (@m-bull?) but I feel these changes are ready for release.

Jared

gkarthik commented 4 years ago

Yikes, that was a big mistake on my part 😅. Thank you for catching it @jts! We are putting together a unit test for this bit and should have a release out by tomorrow.