DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
475 stars 119 forks source link

-I and -X : confusion or bug? #52

Closed jdidion closed 8 years ago

jdidion commented 8 years ago

I am evaluating HISAT2 and other aligners using simulated reads. I see numerous cases where HISAT2 picks the incorrect alignment even when the fragment size is much larger than the max specified by -X. I am relying on the default value of -X (500) rather than setting -X explicitly.

An example:

chr1-143174182--:chr1-143173984-I,chr1-143173983-:::::::::::82309NjU0:0 83 chr1 143174182 255 150M = 142677513 -496819 AAGTTAACCACTTCCAAGTTACAATGATTCTACAGGCATTGGGTAAACATTCCAAGCCAAATAGAAGAAATTTCCCAGAAAGAAGCTCAAAACACAGATGGGACTTACACACCCCCTGCAAGTCAAAAACCCAGCAGGCCAGGCATTCCA GGCGGGCGGCCGGGCGGGGGCGGGG8GGGCGCCGCGG=GCGGGGGGCCCJGGGGCCCGCGGGCG=CGGG=GGGGGCGGG=J8CGGGJJGGGGJJJGGGJGGCJJGGGGGJJGJJJJ=JJGJJJGJJJGGJJJJGGJJGGGGGGGGGGCC1 AS:i:-18 ZS:i:-36 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:29G16G39A6 YS:i:-24 YT:Z:CP NH:i:1

chr1-143174182--:chr1-143173984-I,chr1-143173983-:::::::::::82309NjU0:0 163 chr1 142677513 255 150M = 143174182 496819 ACACAGAGCCAAAACATATTATTGTGTCCCTGGTCCCCCAAAATTCAGGTCTTTCTCACATTGCAAAATGCAATAAGGCCTTCCCTAGAGTCCCCCAAATCTTAACTCATCCCAGCATTTACTCAAATGTCCAAAGGCCAAAGTCTCCTC C=CGGGGGGGGGGGJJJJJGJJCGJJJGJJGJJJGJGJCJCJJ=JGJGJGJG8JGJGGGGCG11GGGGCGGCG8CCGCGCGCGGCCGGGGGGCGGGCGGCCGJJJJJCCGGGGG=CGGGCGGCGG=GGGG8GGGCGG1(GGGCGCGGG=C AS:i:-24 ZS:i:-26 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:23C18G4T26G75 YS:i:-18 YT:Z:CP NH:i:1

For some reason, this hit gets a high alignment score even though the fragment size (496819) is much bigger than the limit. The correct alignment (found by BWA):

chr1-143174182--:chr1-143173984-I,chr1-143173983-:::::::::::82309NjU0:0 83 chr1 143174182 60 150M = 143173983 -349 AAGTTAACCACTTCCAAGTTACAATGATTCTACAGGCATTGGGTAAACATTCCAAGCCAAATAGAAGAAATTTCCCAGAAAGAAGCTCAAAACACAGATGGGACTTACACACCCCCTGCAAGTCAAAAACCCAGCAGGCCAGGCATTCCA GGCGGGCGGCCGGGCGGGGGCGGGG8GGGCGCCGCGG=GCGGGGGGCCCJGGGGCCCGCGGGCG=CGGG=GGGGGCGGG=J8CGGGJJGGGGJJJGGGJGGCJJGGGGGJJGJJJJ=JJGJJJGJJJGGJJJJGGJJGGGGGGGGGGCC1 NM:i:3 MD:Z:29G16G39A63 AS:i:135 XS:i:120 XA:Z:chr1,+143413960,150M,6;chr4,+49574530,150M,6;chr1,-142677712,150M,6;

chr1-143174182--:chr1-143173984-I,chr1-143173983-:::::::::::82309NjU0:0 163 chr1 143173983 40 150M = 143174182 349 ACACAGAGCCAAAACATATTATTGTGTCCCTGGTCCCCCAAAATTCAGGTCTTTCTCACATTGCAAAATGCAATAAGGCCTTCCCTAGAGTCCCCCAAATCTTAACTCATCCCAGCATTTACTCAAATGTCCAAAGGCCAAAGTCTCCTC C=CGGGGGGGGGGGJJJJJGJJCGJJJGJJGJJJGJGJCJCJJ=JGJGJGJG8JGJGGGGCG11GGGGCGGCG8CCGCGCGCGGCCGGGGGGCGGGCGGCCGJJJJJCCGGGGG=CGGGCGGCGG=GGGG8GGGCGG1(GGGCGCGGG=C NM:i:5 MD:Z:1G21C18G4T26G75 AS:i:128 XS:i:130 XA:Z:chr1,+142677513,150M,4;chr4,-49574729,150M,5;chr13,+19370710,150M,8;

Is this a bug? Or do I need to explicitly set -X?

infphilo commented 8 years ago

Unfortunately, -I and -X are not implemented in HISAT2. But I'll implement them soon, possible in the next release of HISAT2.

infphilo commented 8 years ago

Now, -I and -X options work in HISAT2 (the master branch). These options are valid only with --no-spliced-alignment, which is used for aligning DNA-seq reads.

jdidion commented 8 years ago

Thank you!