Closed gkarthik closed 4 years ago
This would be a great feature - adding that we occaisionally see impossibly large insert sizes (> a single amplicon), perhaps from chimeras, so being able to set minimum and maximum insert sizes would be nice.
Adding to what @m-bull said, we also see reads that have what should be impossible alignments, like starting in sequence that is unique to an amplicon from pool1, and ending in sequence unique to the adjacent amplicon in pool2. It would be great if these could be filtered as well as they are likely aberrant amplicons.
Hi all,
Could I get an update on this issue?
Jared
Hi Jared,
I haven't had time to start on this issue yet, but planning to do so as soon as I can. Definitely by this week. Will keep you updated.
Cheers, Al
Hi all,
I've looked into this issue a little more and have developed a set of test reads to investigate the trimming behaviour for certain arrangements of pairs. Here are the reads I am using:
>read1_full_amplicon_readthrough/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCACTCAAAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTATCTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAGAACAACCTACTAGTGAAGCTGTTGAAGC
>read1_full_amplicon_readthrough/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTACAATCCCTTTGAGTGCGTGACAAATGTTTCACCTAAATTCAAGGCTTTAAGTTTAGCTCCACCAATAATGATAGAGTCAGCACACAAAGCCAAAAATTTATTTACAAGCTTAAAGAATGTCTGAACACTCTCCTTAATTTCCTTTGCACAGGTGACAATTTGTCCACCGACAATTTCACAAGCACAGGTTGAGATAAATTTAACAATTTCCCAACCGTCTCTAAGAAACTCT
>read2_full_amplicon_disjoint/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTA
>read2_full_amplicon_disjoint/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTAC
>read3_left_readthrough/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTA
>read3_left_readthrough/2
TAGCTCCACCAATAATGATAGAGTCAGCACACAAAGCCAAAAATTTATTTACAAGCTTAAAGAATGTCTGAACACTCTCCTTAATTTCCTTTGCACAGGTGACAATTTGTCCACCGACAATTTCACAAGCACAGGTTGAGATAAATTTAACAATTTCCCAACCGTCTCTAAGAAACTCT
>read4_right_readthrough/1
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTAC
>read4_right_readthrough/2
GTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTATCTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAGAACAACCTACTAGTGAAGCTGTTGAAGC
>read5_left_ambiguous/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCA
>read5_left_ambiguous/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAA
>read6_right_ambiguous/1
AGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGTCGGTGGACAAATTGTCACCTGTGCA
>read6_right_ambiguous/2
GCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAA
I've attached an image above showing how these reads align to the reference genome and their relationship to the amplicon primers. Here's a description of each test case:
read1_full_amplicon_readthrough
: the entire amplicon is read from end to end in both reads of the pair, fully overlapping along the entire read length. A LEFT and RIGHT primer should be trimmed from both reads.
read2_full_amplicon_disjoint
: the read pairs start at the ends of the amplicon, and do not overlap. LEFT should be trimmed from the first read, RIGHT should be trimmed from the second read. This is the typical arrangement for ligation based protocols.
read3_left_readthrough
: the read pairs start at the left end only, and perfectly overlap. LEFT should be trimmed from both.
read4_right_readthrough
: the read pairs start at the right end only, and perfectly overlap. RIGHT should be trimmed from both.
read5_left_ambiguous
: the first read in the pair starts in a LEFT primer and ends in the RIGHT pair for the previous amplicon. The second read in the pair does not overlap the first read and ends in a RIGHT primer. LEFT should be trimmed from the first read, RIGHT should be trimmed from the second read.
read6_right_ambiguous
: same situation as above, except the right hand side is ambiguous instead of the left. LEFT should be trimmed from the first read, RIGHT should be trimmed from the second read.
Here is how ivar 1.2.3 trims these reads, with my assessment of the correctness:
# RIGHT primer missed
read1_full_amplicon_readthrough/1 97 MN908947.3 2206 60 24S387M
# LEFT primer missed
read1_full_amplicon_readthrough/2 145 MN908947.3 2182 60 387M24S
# Correct
read2_full_amplicon_disjoint/1 97 MN908947.3 2206 60 24S155M
# Correct
read2_full_amplicon_disjoint/2 145 MN908947.3 2413 60 156M24S
# Correct
read3_left_readthrough/1 97 MN908947.3 2206 60 24S155M
# LEFT primer missed
read3_left_readthrough/2 145 MN908947.3 2182 60 179M
# Correct
read4_right_readthrough/1 81 MN908947.3 2413 60 156M24S
# RIGHT primer missed
read4_right_readthrough/2 161 MN908947.3 2413 60 180M
# Correct
read5_left_ambiguous/1 97 MN908947.3 2206 60 24S64M
# Correct
read5_left_ambiguous/2 145 MN908947.3 2505 60 64M24S
# Correct
read6_right_ambiguous/1 97 MN908947.3 2206 60 24S64M
# Correct
read6_right_ambiguous/2 145 MN908947.3 2505 60 64M24S
In my fork of ivar I have added a command line argument that forces ivar to treat both halves of the pair independently allowing it to trim both primers from reads. This fixes the missed primers for read1, read3 and read4. It overtrims read5 and read6 but this situation should be rare and overtrimming is preferable to undertrimming:
# Correct
read1_full_amplicon_readthrough/1 97 MN908947.3 2206 60 24S363M24S
# Correct
read1_full_amplicon_readthrough/2 145 MN908947.3 2206 60 24S363M24S
# Correct
read2_full_amplicon_disjoint/1 97 MN908947.3 2206 60 24S155M
# Correct
read2_full_amplicon_disjoint/2 145 MN908947.3 2413 60 156M24S
# Correct
read3_left_readthrough/1 97 MN908947.3 2206 60 24S155M
# Correct
read3_left_readthrough/2 145 MN908947.3 2206 60 24S155M
# Correct
read4_right_readthrough/1 81 MN908947.3 2413 60 156M24S
# Correct
read4_right_readthrough/2 161 MN908947.3 2413 60 156M24S
# Overtrimmed, both primers removed
read5_left_ambiguous/1 97 MN908947.3 2206 60 24S42M22S
# Overtrimmed, both primers removed
read5_left_ambiguous/2 145 MN908947.3 2529 60 24S40M24S
# Overtrimmed, both primers removed
read6_right_ambiguous/1 97 MN908947.3 2206 60 24S42M22S
# Overtrimmed, both primers removed
read6_right_ambiguous/2 145 MN908947.3 2529 60 24S40M24S
Hey! Thank you for sending over this test dataset. We are doing a check if the insert size is less than the read length and in those cases, we are excluding the reads. We should have it added within the next 2 days ( We are testing it now).
I went through the flag you added, wouldn't you end up over trimming for reads with insert sizes > read length as well?
We are doing a check if the insert size is less than the read length and in those cases, we are excluding the reads.
Are you going to exclude them, or trim them without considering pairing? With nextera you have valid reads with a short insert if the adapters are inserted within 150bp from an amplicon end, so I you will lose a lot of coverage if you exclude
wouldn't you end up over trimming for reads with insert sizes > read length as well
I'm not sure I follow this case, can you clarify?
Hi Jared,
We are modifying ivar trim
so it identifies if insert size is less than read length, and trims them without considering pairing.
Using the modified code under this branch, we get the expected trimming behavior for all cases (see output below):
# Correct
read1_full_amplicon_readthrough/1 97 NC_045512.2 2206 60 24S363M24S
# Correct
read1_full_amplicon_readthrough/2 145 NC_045512.2 2206 60 24S363M24S
# Correct
read2_full_amplicon_disjoint/1 97 NC_045512.2 2206 60 24S155M
# Correct
read2_full_amplicon_disjoint/2 145 NC_045512.2 2413 60 156M24S
# Correct
read3_left_readthrough/1 97 NC_045512.2 2206 60 24S155M
# Correct
read3_left_readthrough/2 145 NC_045512.2 2206 60 24S155M
# Correct
read4_right_readthrough/2 81 NC_045512.2 2413 60 156M24S
# Correct
read4_right_readthrough/2 161 NC_045512.2 2413 60 156M24S
# Correct
read5_left_ambiguous/1 97 NC_045512.2 2206 60 24S64M
# Correct
read5_left_ambiguous/2 145 NC_045512.2 2505 60 64M24S
# Correct
read6_right_ambiguous/1 97 NC_045512.2 2206 60 24S64M
# Correct
read6_right_ambiguous/2 145 NC_045512.2 2505 60 64M24S
We will aim to add this to a new version of iVar as soon as possible, hopefully by tomorrow (testing in progress).
Thank you for sharing those test reads! They were very helpful.
Cheers, Al
Thanks @AlaaALatif and @gkarthik!
Heya,
I'm just testing this out and its a big improvement! I've found some reads that should be trimmed are missed though, here's an example:
>read9/1
TCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATGACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTTGATTGTTACGATGGTGGCTGTATTAATG
>read9/2
GCATTAATACAGCCACCATCGTAACAATCAAAGTACTTATCAACAACTTCAACTACAAATAGTAGTTGTCTGATATCACACATTGTTGGTAGATTATAACGATAGTAGTCATAATCGCTGATAGCAGCATTACCATCCTGAGCAAAGA
ivar-1.2.4 trim
result:
read9 97 MN908947.3 14762 60 147M = 14762 148 TCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATGACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTTGATTGTTACGATGGTGGCTGTATTAATG * NM:i:0 MD:Z:147 AS:i:147 XS:i:0
read9 145 MN908947.3 14762 60 137M11S = 14762 -148 TCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATGACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTTGATTGTTACGATGGTGGCTGTATTAATGC * NM:i:0 MD:Z:148 AS:i:148 XS:i:0 XA:i:115
These reads overlap along most of their length so the primer should be trimmed from both sequences, instead of just from read9/2
Heyo,
Thank you for following up and sharing the additional edge case. I've modified ivar trim
even further under this branch so that primer trimming occurs properly for all test cases (see output on new test cases below).
read1_full_amplicon_readthrough 97 NC_045512.2 2206 60 24S363M24S
read2_full_amplicon_disjoint 97 NC_045512.2 2206 60 24S155M
read3_left_readthrough 97 NC_045512.2 2206 60 24S155M
read5_left_ambiguous 97 NC_045512.2 2206 60 24S64M
read6_right_ambiguous 97 NC_045512.2 2206 60 24S64M
read1_full_amplicon_readthrough 145 NC_045512.2 2206 60 24S363M24S
read3_left_readthrough 145 NC_045512.2 2206 60 24S155M
read4_right_readthrough 161 NC_045512.2 2413 60 156M24S
read2_full_amplicon_disjoint 145 NC_045512.2 2413 60 156M24S
read4_right_readthrough 81 NC_045512.2 2413 60 156M24S
read5_left_ambiguous 145 NC_045512.2 2505 60 64M24S
read6_right_ambiguous 145 NC_045512.2 2505 60 64M24S
read9 97 NC_045512.2 14762 60 137M10S
read9 145 NC_045512.2 14762 60 137M11S
However there still might be very specific cases where over trimming may sometime occur. Still, we may end up adding this to the next release because conservative behavior should be preferred over the alternative.
Cheers :) Al
Thanks @AlaaALatif, I have run this on ~2500 samples and it now fixes all of the problematic sites I have identified so far, whereas 1.2.4 did not. There are a few new variants that pop up that I want to look at, but I think they are just churn around the allele frequency threshold of 0.25.
I have a second request but I will open a new issue for this as its unrelated to trimming.
@AlaaALatif I have found a corner case that I don't understand where a read containing a deletion is incorrectly trimmed. Oddly it only happens when I provide quality strings, so perhaps is an interaction with the quality trimmer. Here is a test case:
@read10_deletion/1
ACGCAGAAGGGAGCAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCTCATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTAGGGGAACTTCTCCTGCTAGCAATGGCGGTGATGCTGCGCTTGCTTTGCTGCTGCTT
+
AAAA33>>FFABGEE4GGFEG222EHCGG3A1A1BFHHF1FFGEFGFBEFHG1?F1D?EEEGHBGDF@FB33F4G@F@GH3FGC3??EA?BGD43/E?/?GF441BGHHHFHEHHH110/<A//?11?11/->CFC<<1>FGD1<0<0<
@read10_deletion/2
GTTTCTTCGAGGCCTCAGCAGCCGATTTCTTCGAGACAGCTTGGCCTTGTTGTTGTTGGCCTTTACCTGACATTTTGCTCTCATGCTGCTTCAATCTGTCAAGCAGCAGCAAAGCAAGAGCAGCAACA
+
B00G<F1E0HHGB001B0F00G0B112FB2211EF1GB11F@B121111B1@22F@B0011BGF11G1EHGB00GB0A11F1AB/01A/00?FA002F220B000000FBB11GF111113B3>>>11
The resulting bam file is:
99 MN908947.3 28788 60 111M9D5M33S = 28918 258 ACGCAGAAGGGAGCAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCTCATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTAGGGGAACTTCTCCTGCTAGCAATGGCGGTGATGCTGCGCTTGCTTTGCTGCTGCTT AAAA33>>FFABGEE4GGFEG222EHCGG3A1A1BFHHF1FFGEFGFBEFHG1?F1D?EEEGHBGDF@FB33F4G@F@GH3FGC3??EA?BGD43/E?/?GF441BGHHHFHEHHH110/<A//?11?11/->CFC<<1>FGD1<0<0< NM:i:10 MD:Z:111^GAATGGCTG19T18 AS:i:129 XS:i:0
147 MN908947.3 29034 60 116S8M4S = 28788 -258 TGTTGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAAGCAGCATGAGAGCAAAATGTCAGGTAAAGGCCAACAACAACAAGGCCAAGCTGTCTCGAAGAAATCGGCTGCTGAGGCCTCGAAGAAAC 11>>>3B311111FG11BBF000000B022F200AF?00/A10/BA1F11A0BG00BGHE1G11FGB1100B@F22@1B111121B@F11BG1FE1122BF211B0G00F0B100BGHH0E1F<G00B NM:i:11 MD:Z:2A36C4T15T27A5A1T8T11T2T5G1 AS:i:79 XS:i:0 XA:i:211
In this case neither softclip that was applied to the reads makes sense.
Thanks again for all your work on this.
Hello Jared,
The last 33 bps of the first read are trimmed off because the quality drops below the default threshold of 20. In case of the reverse read, the quality drops below 20 after 12 bases from the 5' end of the reverse read. The last 4 bases are soft clipped because of an overlap with a primer region.
I think in both cases this is the expected behavior?
Yes sorry, you're right, I get the same result with ivar-1.2.3. I see new low quality variant calls following this deletion, so I thought this read demonstrated the problem - I'll look into this further.
Here's a fixed demonstration of the issue I tried to raise above:
@read10_deletion/1
CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG
+
AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG
@read10_deletion/2
CTTTTTGAGGGTTATGATTTTGGAAGCGCTATGAAAAACAGCAATAAGCCATCCGAAAGGGAGTGAGGCTTGTATCGGTATCGTTGCAGTAGCGCGAACAAAATCTGAAGGAGTAGCATCCTTGATTTCACCTTGCTTCAAAGTTACAG
+
DGFGG<<<1BF?F1FFCFGDHFHFFFGGFHGGF32B/0C2FEHHG3HFGCGE>//?DFFG3HGHEHDGEG>?BGFD1EGGECDBHFEEDGDEE?DFHHHHHFGFGBHFHHGFHHHFB?GGGGDHHFAGFGGGGG4GGABFA33BAA3>3
Bam prior to trimming:
read10_deletion 99 MN908947.3 25427 60 105M18D44M = 25427 167 CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30 MC:Z:105M18D44M AS:i:120 XS:i:0
read10_deletion 147 MN908947.3 25427 60 105M18D44M = 25427 -167 CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG 3>3AAB33AFBAGG4GGGGGFGAFHHDGGGG?BFHHHFGHHFHBGFGFHHHHHFD?EEDGDEEFHBDCEGGE1DFGB?>GEGDHEHGH3GFFD?//>EGCGFH3GHHEF2C0/B23FGGHFGGFFFHFHDGFCFF1F?FB1<<<GGFGD NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30 MC:Z:105M18D44M AS:i:120 XS:i:0
Bam trimmed by ivar-1.2.3:
read10_deletion 99 MN908947.3 25427 60 105M18D44M = 25427 167 CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30 MC:Z:105M18D44M AS:i:120 XS:i:0
read10_deletion 147 MN908947.3 25560 60 115S34M = 25427 -167 CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG 3>3AAB33AFBAGG4GGGGGFGAFHHDGGGG?BFHHHFGHHFHBGFGFHHHHHFD?EEDGDEEFHBDCEGGE1DFGB?>GEGDHEHGH3GFFD?//>EGCGFH3GHHEF2C0/B23FGGHFGGFFFHFHDGFCFF1F?FB1<<<GGFGD NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30 MC:Z:105M18D44M AS:i:120 XS:i:0
Bam trimmed by insert_sizes
branch:
read10_deletion 99 MN908947.3 25427 60 105M18D44M = 25427 167 CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG AAB3>FFFFFFFGGGGGGGFFGHBHHFFHHCHHHGFHFGHCHHGHH5F5AGGGHGCEGCEGEFHFGH?EFFGHGGGFGHFFGFFHHEEBGHHFEGEGEECGHHFHHFHHHHHHHHGHHFHGHGGGGGHFHHAFHHHHHHHHHHHGHHHG NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30 MC:Z:105M18D44M AS:i:120 XS:i:0
read10_deletion 147 MN908947.3 25427 60 115S34M = 25427 -167 CTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGCTGTTTTTCATAGCGCTTCCAAAATCATAACCCTCAAAAAG 3>3AAB33AFBAGG4GGGGGFGAFHHDGGGG?BFHHHFGHHFHBGFGFHHHHHFD?EEDGDEEFHBDCEGGE1DFGB?>GEGDHEHGH3GFFD?//>EGCGFH3GHHEF2C0/B23FGGHFGGFFFHFHDGFCFF1F?FB1<<<GGFGD NM:i:19 MD:Z:105^TTGTTGGCGTTGCACTTC13G30 MC:Z:105M18D44M AS:i:120 XS:i:0
Note that the same trimming has been applied (115S
) but the start position of the read has not been updated in the new version of ivar (it is 25427
but should be 25560
like in ivar-1.2.3).
Hey, thanks for giving an example of the issue. I just pushed a potential fix to insert_size
. Still need to do more tests but thought I would share it for any useful feedback you may have :)
Cheers, Al
Thanks Al, that fixed the issue I had around indels.
Hi @AlaaALatif and @gkarthik, is there a timeline for making a new release of ivar with this change and ideally #75? We'd like to update our pipelines soon prior to a new data release
Hi @jts, yes we're currently working on this. The plan is to make a release by end of this week. Hope that works! :)
That would be great, thanks!
Hi again, just wanted to update that we're adding final tests for #75 before making a release asap.
Thanks
Thanks @AlaaALatif! I had a look at the amplicon_search branch and noticed it is checking whether individual reads (not pairs) are within the amplicon:
https://github.com/andersen-lab/ivar/blob/amplicon_search/src/trim_primer_quality.cpp#L336
In my python prototype I check whether the paired-end fragment is within the amplicon instead, which will filter more artifacts out:
fragment_start = alignment.reference_start
fragment_end = 0
tlen = alignment.template_length
if tlen < 0:
fragment_start = alignment.reference_end
fragment_end = fragment_start + tlen
start_intervals = tree.at(fragment_start)
end_intervals = tree.at(fragment_end)
Hi,
I'm testing the latest change to the branch - I think:
fragment_coords.low = r->core.pos + bam_endpos(r) + r->core.isize;
fragment_coords.high = r->core.pos + bam_endpos(r);
should be:
fragment_coords.low = bam_endpos(r) + r->core.isize;
fragment_coords.high = bam_endpos(r);
as bam_endpos
already includes r->core.pos
.
Yes I agree, I changed it accordingly. Thanks!
Thanks. I have tested a merge between the insert_sizes
and amplicon_search
(with the endpos fix) on >2000 genomes and am happy with the results. It would be good to get the opinion of others (@m-bull?) but I feel these changes are ready for release.
Jared
Yikes, that was a big mistake on my part 😅. Thank you for catching it @jts! We are putting together a unit test for this bit and should have a release out by tomorrow.
Problem: iVar by default trims forward primers form
+
strand and reverse from-
strand. However, if the insert size is smaller than the read length, it should also trim reverse primers from the+
strand and vice versa. This should not affect the consensus since the ~90% of insert sizes >= 200. However, it might cause issues if the amplicon size is very small or if there are issues during library perp and during downstream variant calling.More details in #72 raised by @jts.
Potential fix: