DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
473 stars 116 forks source link

Hisat2 does not align almost exact reads with snptran index #383

Open bgorissen opened 2 years ago

bgorissen commented 2 years ago

I'm using Hisat2 and found a few near exact reads that don't align, and I can't tell if it's due to a bug or a shortcoming of the algorithm.

The reads and closest matches are shown at the end of this report for readability. I'm using the prebuilt grch38_snptran.tar.gz index and confirmed that it contains the reference shown below with hisat2-inspect.

When I use the index without SNPs and transcripts (grch38_genome.tar.gz), both reads get uniquely aligned, so presumably it has to do with an increase in potential mapping locations. Based on this issue I've set the options --max-seeds and --max-altstried to very generous values but that didn't change anything. To reduce the number of potential locations, I've created an index grch38_chr12_snptran, which is grch38_snptran but only contains chr12. With this index, Hisat2 still would not align the first example (I have not performed this experiment for example 2 and chr13).

Adding the option --bowtie2-dp 2 makes Hisat2 align the first example to chr12:54284611 with index grch38_snptran, but not with grch38_chr12_snptran (no alignment is found). Adding --score-min L,0,-1 in addition to --bowtie2-dp 2 resolves that too. Upon closer inspection, setting --score-min L,0,-1 makes align() (hi_aligner.h:5551) find an extra anchor hit after which --bowtie2-dp 2 allows hybridSearch() (spliced_aligner.h:142) to find the match.

For Example 2, none of these options work. I hope these small detailed examples help you pinpoint potential issues as I'm looking forward to using the snptran index.

Example 1 read:

GATTTGTGAACTCAGCCAAGCACAGTGGTGGCAGGGCCTAGCTGCTACAAAGAAGACATGTTTTAGACAAATACTCATGTGTATGGGCAAAAAACTCGAGGACTGTATTTGTGACTAATTGTATAACAGGTTATTTTAGTTTCTGTTCTGTGGAAAGTGTAAAGCATTCCAACAAAGGGTTTTAATGTAGATTTTTTTTTTTGAACCCCATGCTGTTGATTGCTAAATGTAACAGTCTGATAGTGACGATGAATAAATGTCTTT

Closest matches within the reference to example 1:

Chr5:136429928 (Levenshtein distance 24 to read): GATTTGTGAACTCAGCCAAGCACAGTGGTGGCAGGGCCTAGCTGCTACAAATAAGACATGCTTTAGACAAATACTCATATGTATGGGCAAAAAACTCAAGAATTGTATTTGTGACTAATTGGATAACCAGTGATTTTAGTTTCTGTTCTGTGGAAAGTATAAAGCATTCCAACAAAGGGTTTTAATGTAGTTTTTTTTGTTTTTGCACCCATGCTATTGATTGCTAAATGTAATAGTCTGACATGATGCTGAATAAATGTGTCT Chr10:46416484 (distance 17): GATTTGTGAACTCAGCCAAGCACAGTGGTGGCAGGGCCTAGCTGCTACAAAGAAGACATGTTTTAGACAAATACTCATGTGTATGGGCAAAAAATTCGAGGACTGTATTTGTGACTAACTGTATAACAGGTTATTTTAGTTTCTGTTCTGTGGAAAGTGTAAAGCATTCCAGCTAAGGGTTTTAATATAGGTTTTTTTTTTTTTTGCACCCATGCTGTTGATTGCTAAATGTAATAGTCTGATCATGACGCTGAATAAATGTCT Chr12:54284611 (distance 3): GATTTGTGAACTCAGCCAAGCACAGTGGTGGCAGGGCCTAGCTGCTACAAAGAAGACATGTTTTAGACAAATACTCATGTGTATGGGCAAAAAACTCGAGGACTGTATTTGTGACTAATTGTATAACAGGTTATTTTAGTTTCTGTTCTGTGGAAAGTGTAAAGCATTCCAACAAAGGGTTTTAATGTAGATTTTTTTTTTTGCACCCCATGCTGTTGATTGCTAAATGTAACAGTCTGATCGTGACGCTGAATAAATGTCTTT Chr13:52643522 (distance 8): GATTTGTGAACTCAGCCAAGCACAGTGGTGGCAGGGCCTAGCTGCTACAAAGAAGACATGTTTTAGACAAATACTCATGTGTATGGGCAAAAAACTCGAGGACTGTATTTGTGACTAATTGTATAACAGGTTATTTTAGTTTCTGTTCTGTGGAAAGTGTAAAGCATTCCAACAAAGTGTTTTAATGTAGATTTTTTTTTTTGCACCCATGCTGTTGATTGCTAAATGTAATAGTCTGATTGTGACGCTGAATAAATGTCTCTA ChrX:119221811 (distance 19): GATTTGTGAACTCAGCCAAGCACAGTGGTGGCAGGGCCTAGCTGCTACAATGAAGACATGTTTTAGACAAATACTCATGTATATGGGCAAAAAACTCGAGAACTGTATTTGCGACTAATTGTATAACAGGTTATTTTAGTTTCTGTTCTGCGGAAAGTATAAAGCATTCTAACAAAGGGTTTTAAATGTAGATTTTTTTTTGCACCCATGCTGTTGATTGCTAAATATAACAGTCTGATCGTGATGCTGAATAAAGGTCTTTTT

Example 2 read

CTGCCTGTGCAGAGGCCTTGGCCTTCCCGACCCACATGGACCCTCCTTGGTTCATGCCCTACACAGCTTTTCCCTTCTCTCTGTGGAGGGGAGAAAGGGTACATGGAGCATGAGGAGGAACTGGGGTGCCTCTTACCCAGACTTAAGTAACCCTCTACTTCTCTCCTCCTTCCACAGGGCCTAGACCCTCTAGTCCAGGGGTATCTAGTCTTTTGAACTCTTATGTGACACATAGGAAGAAGAATTGTCTTGTGCCACACATAA

Closest match to example 2:

Chr13:99337707 (distance 4): CTGCCTGTGCAGAGGCCTTGGCCTTCCCAACCCACATGGACCCTCCTTGGTTCATGCCCTACACAGCTTTTCCCTTCTCTCTGTGGAGGGGAGAAAGGGTACATGGAGCATGAGGAGGAACTGGGGTGCCTCTTACCCAGACTTAAGTAACCCTCTACTTCTCTCCTCCTTCCACAGGGCCTAGACCCTCTAGTCCAGGGGTATCTAGTCTTTTGAACTCTTCTGTGACACATTGGAAGAAGAATTGTCTTGGGCCACACATAA