fulcrumgenomics / fgprimer

APIs and wrappers for performing PCR primer design related tasks
MIT License
0 stars 1 forks source link

Hit.mismatches doesn't account for indel length #5

Open nh13 opened 4 months ago

nh13 commented 4 months ago

I think we count a multi-base indel only once when subtracting the indels from the edits to obtain mismatches https://github.com/fulcrumgenomics/fgprimer/blob/6cf2542e927ced37dd0dce4c335de8dff07789c7/src/main/scala/com/fulcrumgenomics/primerdesign/offtarget/BwaAlnInteractive.scala#L94C7-L94C12

@tfenne thoughts?

nh13 commented 4 months ago

Check out these test cases:

  1. the first query (q1) exactly matches the reference, and the NM tag has value zero.
  2. the second query (q2) has one 2bp deletion and a single mismatch and the NM has value three, which would set Hit to have edits to three, so then the mismatches method would return two instead of the expected one.
test.fq ``` @q1 AGTGATGCTAAGGGTCAAATAAGTCACCAGCAAATACACAGCACACATCTCATGATGTGC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @q2 AGTGATGCTAAGGGTCAAATAAGgCACCAGCAAATACACCACACATCTCATGATGTGC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ```
test.sam ``` @HD VN:1.5 SO:unsorted GO:query @SQ SN:chr1 LN:10001 @PG ID:bwa PN:bwa VN:0.7.17-r1198-dirty CL:bwa aln -S tests/offtarget/data/miniref.fa test.fq q1 0 chr1 1081 37 60M * 0 0 AGTGATGCTAAGGGTCAAATAAGTCACCAGCAAATACACAGCACACATCTCATGATGTGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:60 HN:i:1 q2 0 chr1 1081 37 39M2D19M * 0 0 AGTGATGCTAAGGGTCAAATAAGGCACCAGCAAATACACCACACATCTCATGATGTGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:23T15^AG19 HN:i:1 ```
miniref.fa ``` >chr1 CAGGTGGATCATGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAGCCCCACC TCTACTAAAAATACAAAAAATTAGCTGGGCATGATGGCATGCACCTGTAATCCCGCTACT TGTGAGGCTGAAGCAGGAGAATTGCTTGAACCCAGAAGGTGGAGGTTGCAGTGAGCCGAG ATTGCGCCATTGCACTCTAGCCAGGGAGACAAAGCAAGACTCCATCTTGAAAAAAAATAA TTAAGCTAGCAGACTGGGCAGGTGGCTCACGCCTATAATCCCAGCACTTTGGGAGGCCGA GGTGGGTGGATCACCTGAAGTCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAAATAC CCCATCTCTACTAAAAGTACAAAAATTAGCTAGGCATGGTGGCTCATGCCTGTAGTCCCA GCTAATTGGGAGGCTGAGGCACGAGAATCGCTTGAACCTGAGAGGTGGAGGTTGCAGTGA GCCCAGATCACGACACTGCACTCTAGCCTGGGCAACAGCATGAGACTTGGTCTCAAAAAA AATAAGTCAATAAGACAAAAAAAAAAATTCAGCTAGCAATCTTGAATTTTACTGATTTAC TTGAATATATAATTAACTTTAAATTTATATGCTTGTTCATGTACATGTAAGGAAAATACA TTCCATAAAAAATCACAGATATAGTACTTAGATCTTTAAACTGTAGGGTCTAGGAAAATA ATAAATGATCGCCTCTTTTTTTGTTGTTGTTGTTGAGATGGAGTCTCTCTCTGTTGCCCA GGCTAGAGTGCAGTGGTGCGATCTCGGCTCACTGCAACCTCTGCCTCCTGAGTTCAAGCA ATTCTCCTGCTTCAGCCTCCTGAGTAGCTGGTATTACAGGTGCCCACCACCTCTCCTGGC TAATTTTTGTATTTTTAATAGAGACGGAGTTTTACCATCTTGGCCAGGCTGGTCTTGAAC TCCTGACCTCGTGATCCACCCGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGC TACCGTGCCTGGCTGATTGCCTCTTTTTTAGTAGTAGAAGTGAGGTAAATGCCTGTTGGC AGTGATGCTAAGGGTCAAATAAGTCACCAGCAAATACACAGCACACATCTCATGATGTGC TCCAGCTGGCATCTCATTTGAGGGCAGAAAATCACTCCCTTTTGTCTAAAAACTAGTGTT CAGGAACTGATCCTGCAGCTCCCACCCGGGCTCTGGCATCACTCTCTCCCAGCATTTGCC AAACCGCAGTGAGAAGAAAGGCAGCTTGTCTTTGCACAAAGAAGCAAGTTTACTTGGGTT TTTTGAGATAGGGTCTTGCTCTGTCGCTCAGGCTGTAGTGCAGTGGTGCGATCATGGCTC ACTGCAGCCTTCACTTCCTGGGCTCAGGCGATCCTCCCACCTCAGCCTTCCAAGTAGCTG GGACAACAGGTGCATACCACCACACTTGGCCAATTTTTCAATTTTTTTGTAGAGACGGTG TCTTGCTGTGTTGTCCAGGCTGGTCTCAAACTCCTGGCCACATGCAATCCTCCTGCCTCT GCCTTGCTTGGGTTTTAATATTTGGAGCTAACCTGGGATTTGGGAGTTCCTGGTGTACCC CCACAGCAGCCAAGGACACTGAAGGTGCGTGCTTCAGAAATGGAGAAGCCCCATGTTAAT GCCCACAGATATTGACTAACTTGTGCAAGTCCTCGCCTTTAGTCCTGTTATGAACCCAAG AAGGTGGTGGTGTTACTAACTTGTCCCAGATTACTGTGTACACTGGGATATATTTATTTA ATTTAATTAGTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGACAGAGTCTC CCTTTGTTGCCCAGTCTGGAGTGCAGTGGCACGGATCTTGGCTCACTGCAACTCCGTCTC CCGGGTTCAAGCAATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCACACGC CCCCACACCCCGTTAATTTTATATTTTTAGTACAGACAGGGTTTCACCACGTTGGCCAGG CTGGTCTCGAACTCCTGAGCTCAGGTGACCTGCCTGCCTCGGCCTCCTAAAGTGCTGGGA TTACAGGCATGAGCCACCACATCTGGCCCTACACTGGGATTTAAAGGGATCCCTTCTTGC CTTCAACCCACATTGCCTTGAGATTAGAAGTGGCTTTGAGATTGAAGGTTAATATAAACA TCTGAAGCTTAGATAAATGTACGTTTGTGTGGACTCCTTTGAGATCCTACCTTCAGGTGT ATATGCTCACACATTTGTAATAGCACGTGCATCAGGCTGTTCCCTCCTACTCGAATGTCT TATTTTCTATTTAACATAATCTAGTAGATGAAAAAGCATGGCTTGACCTGGGTAACAGCC TTATGAGGAATATGGCCTTTTGGACTGTTGGACTGTTGAGGTTCTAGTAAGTGTGGACCT GGCAGAAAGTGACCAGAACTTATGCTAATTTATTCATTTTATTTTATTTTTATTTTTTGA GATAGGATCTCACTCTGTTGCCCAGGCTGGAGTGTAGTGGCACCATCTTGGCTCATTGCA ACCTCCTCCTTCTAGGCTCAAGCGATCCTCCCACCTCAGCCTCCCCAGCAGCCAGGACTG CAGGTGCACACCACCATGCCCAGCTAATTTTTATTTTATTTTTTGGTAGAGATGCGGTTT CACCATGTTGGCCAGGCTGGTCTTGAACTCCTGGCCTCAAGTGATCTGCCTGCCTCGGCT TCCCAAAGTGCTGGAATTACAGGCCTGAGCCACTGGACCTGGCCCAGAACTTATGCTAAT TCAAAGTAAATTTTGGTATTTAAAGAGGCCAGCCTTGTAACTGCAAATCTGTGAAGTGAC AATGTTGCAACATGGTAGTGAGGTAGGAGGCAGGGCTCAACTCCAGAAGCCAGAAGCGGG GCTTGGGACAGCGGACCAAACTGAGGACTAACTAAAACAGGGATAGGATGGAAGCAGCTT TTCATAAAACACATAAAACAGTGTGCCATATCAGTTTACCATTGCCATGGCAACACCTGG AGTTAGCACCCCTTTCCATGGCAATGACCAGAGGACCCAAAAGTTACTACCCCTTCCCTA GAAATGTCTGCATAAACCACCCGTTAGCCGGGCATGGTGGCTCACGCCTGTAATCCCAGC ACTTTGGGAGGCTGAGGTGGGTGGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCC AACATGGTGAAACCCATCTCTACCAAAAATACAAAAATTAGCTGGGCGTGGTGGTGGGCA CCTGTAGTCTCAGCTACTCAGGAGACTGAGGCAGGAGAATCATTTGAATCCGGAGGCAGA GGTTGCAGTGAGCCAAGATCGTGCTGCCATTGCACTCCAGCCTGGGTGACAAGAGCAAAA CTCTGTCTGAATAAACAAACAAACAAACAAACAAAAACAAAAAAACCACCCCTTACTCTG CATGTAACTAGAAGTGGGTATAAATATGACTACAAAACTGCCCTGAGCTGCTACTTTCTG CCTATGGGGTAGCTCTTTTCTGCGGGAGCAGTCACAGAGCTGTGACACTGCTTCTTCAAT AAAGCTGTTTTCTTCTCCCTCTGGCTTGCCCTTGAATTCTTTCCTGGGCAAAGCCAAGAA CCTCTGCAGGCTAATCCCCGCTCTGGGGCTCACCTGCCCTACATGAGTAGTGCAAATTGT AAATTTGCTAACAACAAATTGCCTCACATATTTTATTTTATTTATTTATTTTTTGAGATG GATTCTTGCTCTGTCACCTAGGCTGGAGTGCAGTAGCGAGATTTCAGCTTGCTGCAACCT CCACCTCCCGGGTTCAAGAGATTCTCCTGCCTAAGCCTCCCGAGTAGCTGGGATTACAGG CACCCCCCACCACGCCTGACTAATTCTTGTATTTTTAGTAGAGATGGGGTTTTGCCATGT TGGCCAGGCTGGTCTCGAACTCCTGACCTCAGGTGATCTGCCCTCCTCAGCCTCCCAAAG TGTTAGGATTACAGGTGTGAACTACCACGCCTGGCCTGCCTCACAATTTTTTTTTTTTTT TTTTTTTTTTTAGATGGAGTTTTGCTCTTGTTGCCCAGGCTGGAGTGCAATGGCGGGATC TCGGCTCACCGCAACTTCCGTCTCCCCGGTTCAAACAATTCTCCTGCTTCAGCCTCCTGA GTAGCTGGGATTGCAGGCATGCCCCACCACGCCCAGCTAATTTTGTATTTTTAGTAGAGA CGGGGTTTCTCCATATTGGTCAGGCTGGTCTCCAACTCCCGACCTCAGGTGATCGCCCAC CTCTGCCTCCCAAAGTGCTGGGATTAAGGCATGAGCCACTGCGCCCAGCCAGATTGATGG ATTGATTGATTTTGAGATGGAGTTTCCCTCTTGTTGCCCAGGCTGGAGTGCAATGGTGCA ATCTCAACTCACCTCAACCTCTGCCTCCCAGGTTCAAGCGACTCTCCTGCCTCAGCCTCT GGAGTAGCTGGGATTACAGGCATGCGCCACCATGCCCGGCTAATTTTGTATTTTTAGTAG AGACGGGGTTTCTCCATATTGGTCAGGCTGGTCGCGTTGGTCTGCCCGCCTCGGCCTCCC GAAGTGCTGGGATTACAGGCATGAGCAACCGTGCCCGGCGCCTCACAGATTTTAAAAGCG TAACTCTAAACTCATTGTTAGTCTAAAGTTATTGGGTTTTGATTTGCTTACATAATAGGG TTTAAGGAAAGTCAGCAGTAAGTTTGGCTTGGTCATATTAATAATAGGAAATGAGCCTGA GTAACATGGTGAAACTTCATCTCTACCAAAGAAAAATTCAAAAATTAGCCAGGTGTGGTG GCACATGCCTGTAGCCCCAGCTACTTGGGAGGCTGAGGTGGGAAGATCTCTCGAGCCTGG GAAGCAAAGGCTGCAGTGAGCCGAGATTGCACCACTGCAGTCCAGCCTGGGCAACAGAAT GAGACCCTGTCTCAAAAAAATAATAATAATAGGAAGTGATTTTAAGGTTTTGGTCTCAAT ACTTAAATATTTAAATATTGTTGAAAACCAGTAAAGCCTGGATCATATTGCATCTCAAAC TAAAAACTGGAGTTCTAGATTTAAACACACACACACAAGCTTTTTTATTTGCAGCTGAGA CTACAGGCATGTACCACTATGCCCAGCTGTTTTTTTGAGATATTTGTTTGTTTGTTCATT TGTTTGTTTTTGAGATGGAGTTTTACTCCGTCCCCAGGCTACAGTGCAGTGGCTCGACCT CAGCTCACTGCAACCTCCGCCTCCTGGGTTCAACTGATTCTCCTGCCTCAGCCCGACCCA GGTGTGAGCCACCATGCCCAGCCTCAGCTGTTTTTATTTTTTTATAGAAATGGGGTCTTG CTATGTTGTCCAGGCTGGTCTTGAACTCCTGGGCTCAAGTGATCCTCCCACCTTGGCCTC CAGAAGTGTTGGGATTACAGGTATGAGCCACTGTGCCTGGCTACAAAAATTTTTTCTTAG GATGAGGACATTTATACTATTGTTTTATTTTTGGTTGTTGTTGTTTGGGTTTTTTTTTTT TTGAGACGGAGTCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTCGCACGATCTCGGCTCAC TGCAGCCTCCACCTCCTGGGTTCAAGCAATTCTCCTGCCACAGTCTCCCGAGTAGCTGGG ATTACAGGGGTGCACCACCATGCCCAGCTAATTTTTGTTTTTCAGTAGAGATGGGGTTTT GCCATGTTGCCCAGGCTGGTTTTGAACTCCTGACCTCAGGTGATCCACCTGCCTCTGCCT CCCAAAGTGCTGGGATTACAGGCATGAGCCACCACGGCTGGCAGAATCACGTCTCATCTC TAACTCCTCTCCTTCCTCCTCCCTCTCCCCGATTCTGCGGCAGATACACTAGGCTCCTCA GAGTTCCCGAAACACACTGGCACACCCTTCCTCAGAGTCCCAGGGCTCCCTGGATTTCTG CCTTAAACTATTTTCCCAGAAATCTGTGTGGCTTGATTCTTTACTTCTTTCAGCTCCCCG CTGAGACGTCACCTGGCCATTATTTAAAATAGTGGTGTTTATTTCTAGCTACATAATATG CTCAAAGGCAGTAAGTGGAACTGGGATTCCAAACCCTGATCTTCATCTTCTTAGCATTCC ATGTTTCCCTGTGGAACTCTTCTTTAAAGCTTATTTAAGAATTCTAGCCGAGCAGGGTGG CTCATGTCTGTAATCCCAGCACTTTGGGAGGCTGAGGTGAGAGGATTGCTTGAGCCCAAG AGTTCGAGACCAACCTGGGCAACATAGTGAGACTTGATCTCTACAAAAAAATATTTAAAA CATTTCCAGGCATGGTGGCATGTGCCTGTAGTCCCAGCTATTCGAGAGGCTGAGATAGGA GAATCACTTGAGCCTGAAAGGTTGAGGCTACAGTGAGCCATGATTGCACCACTGCACTTC AGCCTGGGTGACAGAATGAGATCCTGTCTCAAAAAAAGAAAAAAATTATAAATACATAGT AGTATATAGCTGGGTGTGGTgatgcaagcctgtaattccagttaCTCAGGAGACTGAGGC AAGAGAATTGCTTGAACCCGGGAGTGGAGGTTGCAGTGAGCTGAAATCGTGCCACTGCTC TCCCCAACCTGGGCGACAGAGTGAGACTGTGTCTCGGAAAAAAAAAAAGAAAAAAAAAAG TATATACGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATAGAGAGAGAGAGAGA GAGAGGTATATACATGAATCCAGTGGTTTTTGAGTTTTGTTTTCATTCTTGTTGAGAAAC TGTTCAAATGCTCTCTTAATTCAAAGTGTAAATACATAAGGCAGAAAAAGGCAGAGTTAT GACTGAGGCTGGGTTGGGGGGCCTAAGCCCTGTCCCTTTGGTTTTCTTTTTCTTTCTTTT TTTTTGAGATGGAGTCTCGCTCTGTTGCTCAGGCTGGAGTGCAGTGGTGTGATCTTGGCT CACTGCAACCTCCGCCTCCTGGGTTCAAGCAATTCTCCTACTTCAGCCTCCAAAGTAGCT GGGATTACAGGTATGTGCCACCATGCCCGGCTAATTTTGTATTTTTAGTAGAGATGGGGT TTCTACATGTTGATCAGGCTGGTCTCAAACTCCTGACCTCAGGTGATCCGCCCTCCTCAG CCTCCCAAAAGTGCTGGGATTACAGGTGTGAGCCACTGCACCTGGCCTACAGTTTTTATT TTTTTATAGAGACAGGGTCTTGCTATGTTGCCCAGGCTGGTCTCAAACTCCTAAGCTCAA ACAATCCTCCTGTCTTCTGTGTCCCAAAGTGCTGGAATTACTGCACCTGGCATTTGCAAA CTTTTTAATCAGGCTGTGGTTGGCAGTTTGCCAAGACGATTCCTTGTAGATCTGATTTTG GCAGCAAACAACATAGAAGTCGTACAGGAAATGCTAACAATTACATGTGGTGATTTTGAG AACAGCTACCAAATTCTTCACTTTTGTATCTCAAGCGAATGTTCAAATATTTTTAAAAAT TATTTTTAAGGTATTGACTTTGCCACTCGTAAAATAGCCAAGTTGCTGAAGCCACAGAAA GTGATTGAGCAGAATGGGGATTCTTTTACCATCCACACGAACAGCAGCCTAAGGAACTAC TTTGTGAAATTTAAAGTTGGAGAAGAATTTGATGAAGATAACAGAGGCCTGGACAACAGA AAATGCAAGGTAAAAgatgcaagcctgtaattccagttaGATTACGCTTGTAATCCTAAC ACTTTGGGAGGCCAACGCAGGCGGACCACCTGAGGTCAGTAGTTTGAGACCAGCCTGGGC AACACGGCAAAACCCTGTCTCTACAGAAAAAAATTCAAAAAGTAGGGGGGCGTGCTGGCA GGAGCCTGTAATCCCAGCTACTTAGGAGGCTGAGGCAGGAGAATCACTTGAACCCGGGAG GTGGAGGTTGGTTGCAGTAAGCCAAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAG TGGGACTCCATCTCAAAAAAAAAAAAAAGCAGTAAGTAGGCTGTTGATTTTGCAAGGGTA ACTTGGCATTCTACTTCGTAACACTTGAGGATCCTGCCAGGACAAGCTAACATTTTCTCC TCTCTTCATGCAGAGTTTGGTTATCTGGGACAATGACAGGCTCACCTGTATCCAGAAGGG AGAAAAGAAGAACAGAGGCTGGACCCATTGGATCGAAGGAGACAAACTCCACCTGGTATC CACCACATTTTGTTCTTAATGAGATGATACAGTATTAAAGGAAACATCAGGCCAAGCGTG GTGGCTCACACCTGTAATCCCAGCATTTTAGGAGGCCGAGGTGGGTGTATCACTTGAGGT CAGGAGACTAGCCTGGCCAACATGGTGAAACCCCATCTCTACTATTTTTTTTTTTTTTTT GAGATGGAGTATCGCTGTGTCACCAGGCTGGAGTGCAGTGGCGCGATCTCGGCTCACTGC AACCTCCACCTCCTGGGTCCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACT ACAGGCACGCACCACCACACCCAGCTAATTTTTGTATTTTTAGTGGAGACGGGGTTTCAC CATGTTGGCCAGGATGGTCTCGATCTCTTGACCTCATGATCCACCCGCCTCGGCCTCCCA AAGTGCTGGGATTACAGGCATGAGCCACCACCACACCTGGCCCATCTCTACTGAAAATAC AAAAATTAGCCGGGCATAGTGGCGCATGCCTATACTCACTCTCATCTTATATTAAATGAA ACAGCCGAATATTCCGACAGAGGCAGGAAGATAACAgatgcaagGctgtaaGtccagtta TACCTATTATATGTAATAGCTACTTTATGTATACATAGATATGCATAGATAGATATAGTA GCTCACATCTTTGGAGTGATTATTTTGGGCCCAATTACTGTGCTCAATCCTTTGAGTGCA TTATCTCATCTAATCTTCACAACCCTGTGAAAAGGACGCCATTTTTCCCATTCACAAATA AATTGGGATTTTGAAATTCCCCAAGGCTGCTGTCAGAAGCATCAGAATCCAGTTTAAAAG GGTTTATTCAGACTGGGCGAGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCTGA CGTGGGCGGATCACGAGGTCAGGAGATCAAGATCATCCTGGCCAACATGGTGAAACCCCG TCTCTACTAAAAATACAAAAATTAGCTGGGCATGGTGGCACATGTCTGTAATCCCAGCTA CTTGGGAGGCTGAGGCAGGAGAATTGCTAGAACCAGTTAGTCGGAGGTTGCAGTGAGCCA AGATCGCACCACTGCCCTACGGCCTGGTGACAGAGGCCGTTTCAAAAAAAAATAAAACAA AATAAGGGTTTATTCAGGCATGAAGATGAGAATGGCCACCCAGGAAACACAGACTCCAAA GAAATGGGGTCAGTACACCCAAGCTGAAAAGTTAATGTCTTATTTTTTTTTTTTTTTTGA GATGGAGTCTCTCCCTCTGTCACCCAGTGTCACCCACGCTGGAGTGCAGTGGTGTGATCT CAGCTCACTGCAACCTCTGCCTCCTGGGTTCAAGCGATCCTCCCACCTCAGCCTCCCGAG TAGCTGGGACTACAGGCATGCACCACCACACCCAGCTAGTTTTTGTATTTTTAGCAGAAA CGGGATTTTACCATATTGGCCAGGCTGGTCTCGAACaatgcaagcctgtaattccagttg CCTGGGCCTCCCAAAGTGTTGGGATTACAGGCGTGGCCGCTTGTAATAAAAATTTAATTT CTTGGAATGTAATTCTTGGAGTTTTTCTTTTCCTTTCTTTTTCTTTTTCTTCTTTTACTT TTAAGCTGTTGGACTTGAGGTGTTTTTCTTTAATGGCTTTATTGAGGTATACTTTATGTA CCAAAAAATGCACCTGTTTTAAGTGTACAGTTTGATAATTT ```