DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 112 forks source link

hisat2 is not robust in term of short reads mapping #360

Closed y9c closed 2 years ago

y9c commented 2 years ago

I would like to know if hisat2 is a suitable aligner for short reads (<30nt). The testing is as bellow.

I copied two unique sequences (demo1 & demo2) from the human genome, and download the genome_snp_rep index of hg38 as the mapping reference. Then I add a 5' overhang or 3' overhang respectively. For example, sequence demo1_5 is exactly the sequence of demo1 but there is an unmapped G base on the 5' end. The overhang sequence is very common is some RNA libraries. Because the reverse transcriptase enzyme can add random tail (the length is not fix) during the RT step, and even trimming 3bp from the sequence is recommended, some reads can still have one or two overhang bases after timming.

By running hisat2 with default setings (hisat2 -x genome_snp_rep -U test.fq), it seem that 5' overhang will affect the mapping of demo1 read, while 3' overhang is not. (output1) Is this means the alignment is from the 5'->3' direction? But it is weird that demo2 read show an opposite result.

Then I adjust the mapping parameters into hisat2 --bowtie2-dp 2 --score-min L,0,-1 --sp 1,0 --mp 1,0 --rdg 0,2 -x genome_snp_rep -U test.fq. The result (output2) do not show increase on mapping ratio. Meanwhile, some MD tag look very strange.

Could help me figure out what cause this problem in short read alignment? Thanks!


input:

@demo1
AAATCATGATACATACTGTACAGT
+
FFFFFFFFFFFFFFFFFFFFFFFF
@demo1_5
GAAATCATGATACATACTGTACAGT
+
FFFFFFFFFFFFFFFFFFFFFFFFF
@demo1_3
AAATCATGATACATACTGTACAGTG
+
FFFFFFFFFFFFFFFFFFFFFFFFF
@demo1_53
GAAATCATGATACATACTGTACAGTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFF
@demo2
TTCAAATTCCTCCCTGTACGAAAGGACAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@demo2
TTCAAATTCCTCCCTGTACGAAAGGACAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@demo2_5
CTTCAAATTCCTCCCTGTACGAAAGGACAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@demo2_3
TTCAAATTCCTCCCTGTACGAAAGGACAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@demo2_53
CTTCAAATTCCTCCCTGTACGAAAGGACAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

output1:

demo1   16      20      36045791        60      24M     *       0       0       ACTGTACAGTATGTATCATGATTT        FFFFFFFFFFFFFFFFFFFFFFFF        AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:24 YT:Z:UU NH:i:1
demo1_5 4       *       0       0       *       *       0       0       GAAATCATGATACATACTGTACAGT       FFFFFFFFFFFFFFFFFFFFFFFFF       YT:Z:UU
demo1_3 16      20      36045791        60      1S24M   *       0       0       CACTGTACAGTATGTATCATGATTT       FFFFFFFFFFFFFFFFFFFFFFFFF       AS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:24 YT:Z:UU NH:i:1
demo1_53        4       *       0       0       *       *       0       0       GAAATCATGATACATACTGTACAGTG      FFFFFFFFFFFFFFFFFFFFFFFFFF      YT:Z:UU
demo2   0       MT      3108    60      29M     *       0       0       TTCAAATTCCTCCCTGTACGAAAGGACAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:29 YT:Z:UU NH:i:1
demo2   0       MT      3108    60      29M     *       0       0       TTCAAATTCCTCCCTGTACGAAAGGACAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:29 YT:Z:UU NH:i:1
demo2_5 0       MT      3108    60      1S29M   *       0       0       CTTCAAATTCCTCCCTGTACGAAAGGACAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:29 YT:Z:UU NH:i:1
demo2_3 4       *       0       0       *       *       0       0       TTCAAATTCCTCCCTGTACGAAAGGACAAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  YT:Z:UU
demo2_53        4       *       0       0       *       *       0       0       CTTCAAATTCCTCCCTGTACGAAAGGACAAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF YT:Z:UU

(output2)

demo1   16      20      36045791        60      24M     *       0       0       ACTGTACAGTATGTATCATGATTT        FFFFFFFFFFFFFFFFFFFFFFFF        AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:24 YT:Z:UU NH:i:1
demo1_5 4       *       0       0       *       *       0       0       GAAATCATGATACATACTGTACAGT       FFFFFFFFFFFFFFFFFFFFFFFFF       YT:Z:UU
demo1_3 16      20      36045790        1       25M     *       0       0       CACTGTACAGTATGTATCATGATTT       FFFFFFFFFFFFFFFFFFFFFFFFF       AS:i:0  ZS:i:0  XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0G24       YT:Z:UU NH:i:2
demo1_3 272     20      36045800        1       25M     *       0       0       CACTGTACAGTATGTATCATGATTT       FFFFFFFFFFFFFFFFFFFFFFFFF       AS:i:0  ZS:i:0  XN:i:0  XM:i:14 XO:i:0  XG:i:0  NM:i:14 MD:Z:0T1T0G0T0A0T2T0G2T1T3C0A0C0A1C0    YT:Z:UU NH:i:2  Zs:Z:3|S|rs2254452
demo1_53        4       *       0       0       *       *       0       0       GAAATCATGATACATACTGTACAGTG      FFFFFFFFFFFFFFFFFFFFFFFFFF      YT:Z:UU
demo2   0       MT      3108    60      29M     *       0       0       TTCAAATTCCTCCCTGTACGAAAGGACAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:29 YT:Z:UU NH:i:1
demo2   0       MT      3108    60      29M     *       0       0       TTCAAATTCCTCCCTGTACGAAAGGACAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:29 YT:Z:UU NH:i:1
demo2_5 0       MT      3117    60      30M     *       0       0       CTTCAAATTCCTCCCTGTACGAAAGGACAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:0  ZS:i:-1 XN:i:0  XM:i:20 XO:i:0  XG:i:0  NM:i:20 MD:Z:2C1C0T0G1A1G0A0A0A0G0G0A0C1A2G1A0A0T0A1G0  YT:Z:UU NH:i:1
demo2_3 4       *       0       0       *       *       0       0       TTCAAATTCCTCCCTGTACGAAAGGACAAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  YT:Z:UU
demo2_53        4       *       0       0       *       *       0       0       CTTCAAATTCCTCCCTGTACGAAAGGACAAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF YT:Z:UU

Some other issue might be relative to this:

329 #84

imzhangyun commented 2 years ago

@y9c Very good question!

This is highly related to how HISAT2 is implemented.

For your alignment result, you can find that:

If the additional base is always at the end of your read sequence, I suggest you try -5/--trim5 and -3/--trim3 option.

Best, Leo

y9c commented 2 years ago

Thank you for the explanation. I realized that this might cause another problem even when the reads are long.

After trimming, the start position of read2 might be ahead of end position of read1 as the diagram bellow. Top strand (red) is read 1, while bottom strand (green) is read2. This read can NOT be mapped in PE mode.

image

(a demo sequence from the ASM38351v1 genome)

Read1:

@demo1
CAACTAGGAAGTTGGCTTAGAAGCAGCCACCTTTTAAAGAGTGCGTAATTGCTCACTAG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Read2:

@demo1
TAGTGAGCAATTACGCACTCTTTAAAAGGTGGCTGCTTCTAAGCCAACTTCCTAGTTG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE