COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
777 stars 165 forks source link

Indexing misses duplicates containing non-ATGC characters #935

Open evanlawrence16 opened 5 months ago

evanlawrence16 commented 5 months ago

Describe the bug When I index the transcriptome duplicates with non-atgc characters are not identified as duplicates which leads to issues during quantification.

To Reproduce

using salmon v1.10.0 salmon index -p 12 -t testtranscriptome.fa -i nodecoy_salmon_index --keepDuplicates

Only GeneB is in the resulting duplicate_clusters.tsv This is the transcriptome (both genes are duplicates of one another one GeneA contains non-atgc characters)

GeneB_L ATAACTACCTTCACACCGGCAACCATTTTGTTTACGAAGCTACAGTACTTGACGGTCAGCAGCAGCTCAT TTTCAACATGGCGTCGAGATATGCTGCGAGAGTGCTGTGGAATTTTACAGCTCTTAGAAGTGCGAAAACA CGACATTTACTCAAACGAATATCTCCTCTGAGCAATTTTAGGGATTTTCCTTATTCAAGTGACTTTTTTC GTAGCTCAGCGCGTTTTGTTTGTGACAATTCAGCTGCGAAAAGTGCTCAACTCGGAAAACTAGACGTTGA GAAGTTCCACTTGATATACACGTGCAGGGTTTGCAATACAAGGTCGAGGAAAACAATCTCAAAGCAGGCA TACCATCATGGTGTTGTTATCGTTAAGTGTCCAGGATGTAGTAACAATCACCTCATTGCTGATAATCTTG GATGGTTTTATAATGACAAGAGAAATATCGAAGACATTCTTGAGGAAAAGGGCGAAAAAGTCACCAAGAA TGTAACAGAAGAGTTAACTTTAGAAGTTTTGGCAGACAAAATTAAGGAATGATGGATCCTTTGTAAAGAT TATGGGTAAATTTTGGAGCTACATTTTGTGTACACTAAATCAATTATACTAAATATTTCAAAAAACTGTC ATCAAGGTGACAATGGTGTCTGTTTATTGCAATATTGGTTGTGCCATGGCATACCAAAAGTTCCGAGACA AGAATGTTGCAGATGCGCAGGAAAATTATGGTTTAATTTTGAGCAAAATGCAAGTGAACTTTGGAACAAT AACAAATAATCAATGTGCTTACTATAAACTGTGAAATGTGTGCACTTAAAGTTATAAAGGTTGGAAGTGA CATATTACTCTCCCTGTAAAGACTATGTATTTGTCAGTAAGTGACATTAATGAATCATCATGGTAAGTCA TTCCTCTACAAATAATATTGGAAGATTTGATATTTTGTACTGTTTAATCTTCATGTTATGAAGTTGACAA TCAAAATTAGTTTTCATAATTAGACAAGTTTTTAAATGTTGCTTTCAAAATCCCCATGTTTTTTCTGTTT TGCTTGGAAGCCTGTGAAGCAGAGAAACGTCTTCAATTCATGATGTTGTGTGCAATCTAATATCCCTCAA GTGATTGTAGCAACCCTGGAAAAAGACATGAATTGAATAAATTAGGTAATACCTCATTTAACAGAACATA AAGTGAG

GeneB_R ATAACTACCTTCACACCGGCAACCATTTTGTTTACGAAGCTACAGTACTTGACGGTCAGCAGCAGCTCAT TTTCAACATGGCGTCGAGATATGCTGCGAGAGTGCTGTGGAATTTTACAGCTCTTAGAAGTGCGAAAACA CGACATTTACTCAAACGAATATCTCCTCTGAGCAATTTTAGGGATTTTCCTTATTCAAGTGACTTTTTTC GTAGCTCAGCGCGTTTTGTTTGTGACAATTCAGCTGCGAAAAGTGCTCAACTCGGAAAACTAGACGTTGA GAAGTTCCACTTGATATACACGTGCAGGGTTTGCAATACAAGGTCGAGGAAAACAATCTCAAAGCAGGCA TACCATCATGGTGTTGTTATCGTTAAGTGTCCAGGATGTAGTAACAATCACCTCATTGCTGATAATCTTG GATGGTTTTATAATGACAAGAGAAATATCGAAGACATTCTTGAGGAAAAGGGCGAAAAAGTCACCAAGAA TGTAACAGAAGAGTTAACTTTAGAAGTTTTGGCAGACAAAATTAAGGAATGATGGATCCTTTGTAAAGAT TATGGGTAAATTTTGGAGCTACATTTTGTGTACACTAAATCAATTATACTAAATATTTCAAAAAACTGTC ATCAAGGTGACAATGGTGTCTGTTTATTGCAATATTGGTTGTGCCATGGCATACCAAAAGTTCCGAGACA AGAATGTTGCAGATGCGCAGGAAAATTATGGTTTAATTTTGAGCAAAATGCAAGTGAACTTTGGAACAAT AACAAATAATCAATGTGCTTACTATAAACTGTGAAATGTGTGCACTTAAAGTTATAAAGGTTGGAAGTGA CATATTACTCTCCCTGTAAAGACTATGTATTTGTCAGTAAGTGACATTAATGAATCATCATGGTAAGTCA TTCCTCTACAAATAATATTGGAAGATTTGATATTTTGTACTGTTTAATCTTCATGTTATGAAGTTGACAA TCAAAATTAGTTTTCATAATTAGACAAGTTTTTAAATGTTGCTTTCAAAATCCCCATGTTTTTTCTGTTT TGCTTGGAAGCCTGTGAAGCAGAGAAACGTCTTCAATTCATGATGTTGTGTGCAATCTAATATCCCTCAA GTGATTGTAGCAACCCTGGAAAAAGACATGAATTGAATAAATTAGGTAATACCTCATTTAACAGAACATA AAGTGAG

GeneA_L CGTCTTGTGACATTTTTGCGATTTTTTGATGAAAATATTCAACGATGGAGCGTGGTTTTGAGCAAGAAAA CTTGTACACAATCTCTAAACATGCAGCAGAATTCAAAACTAAGGTGAAAGTTCTTATTGATAATGAAGAA GAGAAGATAGCACTTTTTAACGCTTTGAAATCCTATCACGAGATCCTATTCCTACTTATGATCTTACATA AGAATGCTGGAGATGAGATCCTCTCTGTCAATAATTTGATTTTAAATGAAGCGACACATGAGGAGGTTGT CAATCTACTAAGATCAAGACGAGTATTGGTTTTAAAAGTTAAAAGTACAGGGAAAGTCCCTTGTAAAATA CTTGATTGTATCAGATGGGAAGAAGTACAGGACAAAGAGAATGTCTACTATCACCCTGACCTACTGTTTC AAAGTCCATTAGAGGTGAGACTGTTACTGCGCATGTCATCCATTGACAGTGTCCCTTTAAGGCTGTCTCA GAAACTMTCCGTYTTAGTMCARGACATCCGATCAATCCTMAAYACACCAAAGAGATACCCCCTATATAGA GATGTCAGGTATTTGATAAATCCAGGTGATAGTGAAGCATTTCTCAAACTCATTCCACAGTCTCCTAGTG ATGGTATTCATGTTGTAAGGATTCATAGAACAGGAAAAGAGGAGGCTGGGTTCAGTATAAGGGGAGGAAG AGAGCACAAAGTAGGAGTTTTTGTGTCTTTTGTGCAGAGAGGGTCACCCGCAGATATTGTTGGACTCAAG GCTGGAGATGAGATCCTCTCTGTGAATAATTTGATTTTAAATGAAGCGACACATGAGGAGGTTGTCAATC TACTAAGATCAAGACGAGTATTGGTTTTAAAAGTTAAAAGTACAGGAAAAGTCCCTTGTAAAATACTTGA TTGTATCAGATGGGAAGAAGTACAGGACAAAGAGAATGTCTACTATCACCCTGACCTACTGTGTCCAGTA AAGGCAGTCTATTCTTCTCCTTATTCAATTATTCATTTGTTTATTTATTCATTTGCTTGTTTATTCTTAG AAGAAGAGAAAGACCCGCAAAAGAGCGGTTTAGATGTCGGCAATGAGGAATTTAGTTCAATGACCCCAAA CACCAGAAAACGATTGTTATTCAAAGAAGGCTTCCAAGTGATCGAATCATCAAATGTAATGGCAGCGAAT CCCACCTCACCAACGTCTCCATCTTCATTGAATCCAAGCAACTTTGAAGTGCTAATTTGCATATACAACT ACATATGCAAACACTTAGAGTTCTAGAAAATTTTCTTTTTAATTAAATGCTCTGAAAATCAACGAAATAA ATAAGAAAGAGTTACTTAGAAAGCCACAATTTAAATTTTTGAAATTACTATTTTACATTTCACACAAGCT CGGTAGAAAGTTTGGCTATTTCGAGTGCTATTTTTAGCCATCAAACATTTGTCACGCAAAGGCCCAGTAT GTGACATTGTTTAGCGATTTTCTTGCGAAAAATGAAATATTTTCAAAAACCAATTACACAGCGATTGTTA CCTAATCATACCTTATCAATATACAAAATATGAATAGATTTGATTTTTCGAGCCTTGCCAAGATATGTCT GATTTTTGCGCGAAGTGGCTCTTAATGCATTTTCTCCAGATACAATACATGGCTGTGAGCTCAATAAACC GACCAATAGCAGTAACATAACCCAAAATGGACCGGTCCGTATAACAGTATAAATTAACGTCACGTGATCG GGTCTACTATTCAAAAATAGATTAAGTGATAGGTAGATTGCATGCCGATATATTTAAAAAGGTCTGAATA TAGATCGAAGAGTATTTTAAGTTAAAATAATAGAATATAATAGGGGTAGAGTGGGTAGGGTATTTTGTAA ATTGTAACCGCGGAGGAAGGGGTAAGTAAGTTGACTAGATGCATGTTAGACACAATCTGTATTTATTTCT CGATAACTAGAAAGCTGCAGGACGACTGCAGCACAGAATAGAATATTTATTGAATATAAGGGACATGGTC CACCAGCATCCTTTTCGAGCTTTTATTCATATGTTTGGAAATAAATATACATCGTAATA

GeneA_R CGTCTTGTGACATTTTTGCGATTTTTTGATGAAAATATTCAACGATGGAGCGTGGTTTTGAGCAAGAAAA CTTGTACACAATCTCTAAACATGCAGCAGAATTCAAAACTAAGGTGAAAGTTCTTATTGATAATGAAGAA GAGAAGATAGCACTTTTTAACGCTTTGAAATCCTATCACGAGATCCTATTCCTACTTATGATCTTACATA AGAATGCTGGAGATGAGATCCTCTCTGTCAATAATTTGATTTTAAATGAAGCGACACATGAGGAGGTTGT CAATCTACTAAGATCAAGACGAGTATTGGTTTTAAAAGTTAAAAGTACAGGGAAAGTCCCTTGTAAAATA CTTGATTGTATCAGATGGGAAGAAGTACAGGACAAAGAGAATGTCTACTATCACCCTGACCTACTGTTTC AAAGTCCATTAGAGGTGAGACTGTTACTGCGCATGTCATCCATTGACAGTGTCCCTTTAAGGCTGTCTCA GAAACTMTCCGTYTTAGTMCARGACATCCGATCAATCCTMAAYACACCAAAGAGATACCCCCTATATAGA GATGTCAGGTATTTGATAAATCCAGGTGATAGTGAAGCATTTCTCAAACTCATTCCACAGTCTCCTAGTG ATGGTATTCATGTTGTAAGGATTCATAGAACAGGAAAAGAGGAGGCTGGGTTCAGTATAAGGGGAGGAAG AGAGCACAAAGTAGGAGTTTTTGTGTCTTTTGTGCAGAGAGGGTCACCCGCAGATATTGTTGGACTCAAG GCTGGAGATGAGATCCTCTCTGTGAATAATTTGATTTTAAATGAAGCGACACATGAGGAGGTTGTCAATC TACTAAGATCAAGACGAGTATTGGTTTTAAAAGTTAAAAGTACAGGAAAAGTCCCTTGTAAAATACTTGA TTGTATCAGATGGGAAGAAGTACAGGACAAAGAGAATGTCTACTATCACCCTGACCTACTGTGTCCAGTA AAGGCAGTCTATTCTTCTCCTTATTCAATTATTCATTTGTTTATTTATTCATTTGCTTGTTTATTCTTAG AAGAAGAGAAAGACCCGCAAAAGAGCGGTTTAGATGTCGGCAATGAGGAATTTAGTTCAATGACCCCAAA CACCAGAAAACGATTGTTATTCAAAGAAGGCTTCCAAGTGATCGAATCATCAAATGTAATGGCAGCGAAT CCCACCTCACCAACGTCTCCATCTTCATTGAATCCAAGCAACTTTGAAGTGCTAATTTGCATATACAACT ACATATGCAAACACTTAGAGTTCTAGAAAATTTTCTTTTTAATTAAATGCTCTGAAAATCAACGAAATAA ATAAGAAAGAGTTACTTAGAAAGCCACAATTTAAATTTTTGAAATTACTATTTTACATTTCACACAAGCT CGGTAGAAAGTTTGGCTATTTCGAGTGCTATTTTTAGCCATCAAACATTTGTCACGCAAAGGCCCAGTAT GTGACATTGTTTAGCGATTTTCTTGCGAAAAATGAAATATTTTCAAAAACCAATTACACAGCGATTGTTA CCTAATCATACCTTATCAATATACAAAATATGAATAGATTTGATTTTTCGAGCCTTGCCAAGATATGTCT GATTTTTGCGCGAAGTGGCTCTTAATGCATTTTCTCCAGATACAATACATGGCTGTGAGCTCAATAAACC GACCAATAGCAGTAACATAACCCAAAATGGACCGGTCCGTATAACAGTATAAATTAACGTCACGTGATCG GGTCTACTATTCAAAAATAGATTAAGTGATAGGTAGATTGCATGCCGATATATTTAAAAAGGTCTGAATA TAGATCGAAGAGTATTTTAAGTTAAAATAATAGAATATAATAGGGGTAGAGTGGGTAGGGTATTTTGTAA ATTGTAACCGCGGAGGAAGGGGTAAGTAAGTTGACTAGATGCATGTTAGACACAATCTGTATTTATTTCT CGATAACTAGAAAGCTGCAGGACGACTGCAGCACAGAATAGAATATTTATTGAATATAAGGGACATGGTC CACCAGCATCCTTTTCGAGCTTTTATTCATATGTTTGGAAATAAATATACATCGTAATA