DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 112 forks source link

[Feature request] Directional mapping mode for hisat2-3n aligner #357

Closed y9c closed 2 years ago

y9c commented 2 years ago

It seem that hisat2-3n detect the strand of a sequence by the number of conversion, and assign a YZ tag for each read accordingly. This works for most of the situation, such as the example bellow (a bisulfite sequencing data). There are more C->T conversions on read 1, thus it is marked as positive strand (+). There are more G->A conversion on read2, meanwhile read 2 is reverse direction, thus it is also marked as positive strand (+).

    (R1-->) 5'-ATTATTTAGAATATATTTAATTTTAAA-3'
               |..|...||||.|||...|||.|.|||
5'-GTCTATGGTCATACCACCCTGAACACACCCAATCTCAAATCAGGCCTGGTTAGTACTTGGATGGGAGGATTTGTCCAGTCC-3'

                                    3'-TTTAATCCAAACCAATCATAAA-5' (<--R2)
                                       ||||.|||..|||||||||.||
3'-CAGATACCAGTATGGTGGGACTTGTGTGGGTTAGAGTTTAGTCCGGACCAATCATGAACCTACCCTCCTAAACAGGTCAGG-5'

But this might cause some problem in this two situations,

  1. The conversion efficiency is low on some of region, which is known as clustering effect. Such as the example bellow, the C sites within the second window can not be converted into T efficiently. Although hisat2-3n can still mapped this read (R2) in a correctly way, because there is no C->T conversion singal (by default it will be +) and read2 is in reverse direction, hisat-3n will report this read as negative strand (-).
    (R1-->) 5'-ATTATTTAGAATATATTTAATTTTAAA-3'
               |..|...||||.|||...|||.|.|||
5'-GTCTATGGTCATACCACCCTGAACACACCCAATCTCAAATCAGGCCTGGTTAGTACTTGGATGGGAGGATTTGTCCAGTCC-3'

                                    3'-TTTAGTCCGGACCAATCATGAA-5' (<--R2)
                                       ||||||||||||||||||||||
3'-CAGATACCAGTATGGTGGGACTTGTGTGGGTTAGAGTTTAGTCCGGACCAATCATGAACCTACCCTCCTAAACAGGTCAGG-5'
  1. There is no C site within a short genomic region, which means there would not be any conversion signal for reads lays within this region. Similarly, by default the strand is +, and read2 is in reverse direction. hisat2-3n will report this report as negative strand (-).
    (R1-->) 5'-ATTATTTAGAATATATTTAATTTTAAA-3'
               |..|...||||.|||...|||.|.|||
5'-GTCTATGGTCATACCACCCTGAACACACCCAATCTCAAATGAGGGGTGGTTAGTAATTGGATGGGAGGATTTGTCCAGTCC-3'

                                    3'-TTTACTCCCCACCAATCATAAA-5' (<--R2)
                                       ||||||||||||||||||||||
3'-CAGATACCAGTATGGTGGGACTTGTGTGGGTTAGAGTTTACTCCCCACCAATCATTAACCTACCCTCCTAAACAGGTCAGG-5'

We know that most of libraries are constructed in a strand-specific way, eg, RNA bisulfite libraries for detecting m5C modification, DNA libraries constructed by Swift kit etc. p7 adapter are ligated into the 5' end of single stranded RNA/DNA after conversion (say, C->T). Thus, the read1 should have the same mutation (C->T) as the conversion.

I would like to know is it possible to make use of this information and let hisat2-3n output the YZ tag based on the reads direction rather than the conversion number?

Similar issue might be: isssue #346

Thanks!

y9c commented 2 years ago

A more specific case, which is not often.

    (R1-->) 5'-ATTATTTAGAATATATTTAATTTTAAA-3'
               |..|...||||.|||...|||.|.|||
5'-GTCTATGGTCATACCACCCTGAACACACCCAATCTCAAATGAGGGGTGGTTAGTAATTGGATGGGAGGATTTGTCCAGTCC-3'

                                    3'-TTTACTCTCCACCAATCATAAA-5' (<--R2)
                                       ||||||| ||||||||||||||
3'-CAGATACCAGTATGGTGGGACTTGTGTGGGTTAGAGTTTACTCCCCACCAATCATTAACCTACCCTCCTAAACAGGTCAGG-5'

There is one C->T mutation on read2, although it will be less mismatch when mark read2 as negative strand (YZ:A:-). But I think it will be more reasonable to mark read2 as positive strand (+) and record one G->A mismatch. Because we have the prior knowledge from R1 and the strandness of library.

imzhangyun commented 2 years ago

Hello @y9c ,

Thank you very much for using HISAT-3N and your suggestion. I think making directional mapping for strand specific reads is a good idea. I made a new option for HISAT-3N. HISAT-3N can make directional mapping with the option --directional-mapping. This mode should be about 2x faster than standard non-directional mapping, but it only works for strand specific library. Please check the hisat-3n_directional branch for this test version.

Best, Leo

imzhangyun commented 2 years ago

@y9c

I just merge the directional mapping branch to hisat-3n branch. Now HISAT-3N supports directional-mapping. Thank you for your request. Please check our website for more detail.

y9c commented 2 years ago

Thanks!

y9c commented 2 years ago

Hi @imzhangyun ,

I would like to know if you can add an argument for --directional-mapping to switch the strand orientation?

image

Take the Takara SMARTer kit as an example, the R1 in reverse order. This might cause some problem in using this setting.

Thanks!

imzhangyun commented 2 years ago

Hello @y9c,

I wonder if you can exchange the read1.fq and read2.fq?

Best, Leo

y9c commented 2 years ago

Thanks. I mean single end sequencing. I only have read 1

ezecalvo commented 4 days ago

Hi, was this issue resolved? I do not see the reverse mapping as an option on the most updated hisat3n version. Also can't find the dev branch where it is supposed to be.

Thanks!