DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
473 stars 116 forks source link

hisat-3n directional-mapping and repeat mode on single-cell methylome data #381

Open lhqing opened 2 years ago

lhqing commented 2 years ago

Hi hisat-3n developers,

Thank you for building this wonderful aligner!

I am working on using hisat-3n to align all the single-cell DNA methylome and multiome data generated by the snmC-seq2 (DNA mC), snmCAT-seq (DNA mC + RNA + NOMe), and snm3C-seq (DNA mC + 3C) technologies in Ecker lab at the Salk Institute.

I did a benchmark between the hisat-3n-based pipeline (snakefile) with our previous bismark-based pipeline (snakefile). I notice two great benefits that made us willing to switch to hisat-3n to align our data

  1. The hisat-3n standard mode is much faster than bismark, and more memory efficient when using a higher number of threads to parallel.
  2. If a read pair can't be mapped in pairs, HISAT-3N will automatically try to map R1 R2 in SE mode (if I understand correctly), and retain whichever reads still get mapped. This is very convenient for our PBAT-based methylation methods and helps aligned more reads than simply running PE or DualSE with bismark.

However, I do have questions related to the --directional-mapping and --repeat mode of hisat-3n

1. Version and Example Files

Here are the commands for reproducibility:

The hisat-3n version I am using:

$ hisat-3n --version
/home/hanliu/pkg/hisat-3n/hisat2-align-s version 2.2.1-3n-0.0.2
64-bit
Built on cemba
Tue Jun 28 15:16:00 PDT 2022
Compiler: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY -std=c++11
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

The hisat-3n-build command I used

# genome fasta file from UCSC, mm10, lambda DNA genome added as chrL
# for normal index
hisat-3n-build --base-change C,T -p 10 mm10_with_chrl.fa hisat-3n-dna/mm10_with_chrl

# for repeat index
hisat-3n-build --base-change C,T --repeat-index -p 10 mm10_with_chrl.fa hisat-3n-dna-repeat/mm10_with_chrl

If needed, you can download the FASTQ files and command outputs I mentioned below in this google drive link: https://drive.google.com/drive/folders/1RAJLsl_LQKfisJ5c1VBNWyaOl6t_NBX3?usp=sharing The FASTQ is from a mouse snmC-seq2 cell, I map it to mm10 genome.

2. About the --directional-mapping

When I try hisat-3n mapping in normal mode, the mapping rate is good

hisat-3n ./hisat-3n-dna/mm10_with_chrl -q \
-1 sample_cell.trimmed.R1.fq.gz -2 sample_cell.trimmed.R2.fq.gz \
--unique-only --base-change C,T \
--no-spliced-alignment --no-temp-splicesite \
-t --new-summary --summary-file hisat_3n_summary.normal.txt \
--threads 1 | samtools view -b -q 1 -o output.normal.bam

Multiseed full-index search: 00:01:37
HISAT2 summary stats:
        Total pairs: 50016
                Aligned concordantly or discordantly 0 time: 17784 (35.56%)
                Aligned concordantly 1 time: 29713 (59.41%)
                Aligned concordantly >1 times: 1656 (3.31%)
                Aligned discordantly 1 time: 863 (1.73%)
        Total unpaired reads: 35568
                Aligned 0 time: 24141 (67.87%)
                Aligned 1 time: 9149 (25.72%)
                Aligned >1 times: 2278 (6.40%)
        Overall alignment rate: 75.87%
Time searching: 00:01:39
Overall time: 00:01:51

However, when adding the --directional-mapping parameter, the mapping rate is close to 0.

hisat-3n ./hisat-3n-dna/mm10_with_chrl -q \
-1 sample_cell.trimmed.R1.fq.gz -2 sample_cell.trimmed.R2.fq.gz \
--unique-only --base-change C,T --no-spliced-alignment \
--no-temp-splicesite -t --new-summary --summary-file hisat_3n_summary.directional.txt \
--threads 1 --directional-mapping | samtools view -b -q 1 -o output.directional.bam

Multiseed full-index search: 00:00:55
HISAT2 summary stats:
        Total pairs: 50016
                Aligned concordantly or discordantly 0 time: 49699 (99.37%)
                Aligned concordantly 1 time: 261 (0.52%)
                Aligned concordantly >1 times: 42 (0.08%)
                Aligned discordantly 1 time: 14 (0.03%)
        Total unpaired reads: 99398
                Aligned 0 time: 98972 (99.57%)
                Aligned 1 time: 342 (0.34%)
                Aligned >1 times: 84 (0.08%)
        Overall alignment rate: 1.06%
Time searching: 00:00:57
Overall time: 00:02:01

My best guess is that the --directional-mapping only considers the original top strand (OT) and original bottom strand (OB) alignment, equivalent to Bismark directional mapping mode. However, our data is similar to PBAT library (R1), where R1 are supposed to be mapped to the strands complementary to OT (CTOT) and OB (CTOB), and R2 are supposed to be mapped to OT and OB. You can confirm this by running these two bismark commands, and bismark output will tell you which strand R1 and R2 map to:

bismark ./reference/with_chrl/ --bowtie2 sample_cell.trimmed.R1.fq.gz --pbat -o ./ --temp_dir ./

# part of the bismark output for R1:
Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  0       ((converted) top strand)
CT/GA:  0       ((converted) bottom strand)
GA/CT:  17046   (complementary to (converted) top strand)
GA/GA:  17242   (complementary to (converted) bottom strand)

bismark ./reference/with_chrl/ --bowtie2 sample_cell.trimmed.R2.fq.gz -o ./ --temp_dir ./

# part of the bismark output for R2
Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  13902   ((converted) top strand)
CT/GA:  13885   ((converted) bottom strand)
GA/CT:  0       (complementary to (converted) top strand)
GA/GA:  0       (complementary to (converted) bottom strand)

# or you can map the data in PE, need to add the --pbat parameter. 
# Besides, the overall mapping rate is lower in PE due to low R2 mapping rate
bismark ./reference/with_chrl/ --bowtie2 -1 sample_cell.trimmed.R1.fq.gz -2 sample_cell.trimmed.R2.fq.gz --pbat -o ./ --temp_dir ./

# part of the bismark output for PE mapping
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   0   ((converted) top strand)
GA/CT/CT:   12015   (complementary to (converted) top strand)
GA/CT/GA:   11999   (complementary to (converted) bottom strand)
CT/GA/GA:   0   ((converted) bottom strand)

Bismark provides a --pbat parameter that allows mapping the PBAT library in directional mode, which is different from the default directional mode. In HISAT-3N, is there a way to have directional mapping flexible for different kinds of libraries?

3. About the multi-aligned reads

I also tried the --repeat mode of hisat-3n. I noticed the reads in the resulting BAM file only have three kinds of MAPQ scores (0, 1, 60). I wonder am I understand these three scores correctly? 0: reads not aligned, 1: reads aligned >1 time 60: reads unique aligned 1 time, including properly paired and discordantly paired

I noticed for the reads with MAPQ==1 will occur multiple times in the BAM file. When using the hisat-3n-table to generate counts for these reads, are you counting them multiple times at different genome locations?

Thank you for reading this rather long issue, I appreciate your time, and thanks again for this great tool!

Hanqing

imzhangyun commented 2 years ago

Hello Hanqing @lhqing,

Thank you very much for using HISAT-3N as your sequence aligner. Also, this is a great issue description and it is easy to repeat.

For 2: I map the reads you provided to GRCm39 and I got the same result as you. The reads can be mapped to the reference on normal mode, but only 1.07% overall alignment rate when using --directional-mapping. You are right because HISAT-3N assumes the original top strand (OT) and original bottom strand (OB) alignment. To solve this problem, we are developing a new version of HISAT-3N with could make the reverse directional mapping. Please check the hisat-3n-dev-directional-mapping-reverse branch. It has --directional-mapping option that works as before, and --directional-mapping-reverse that supports the PBAT library.

Here is my testing result:

$ ../../hisat-3n-dev-directional-mapping-reverse/hisat-3n \
-x ../../data/index/hisat-3n/GRCm39 \
-q -1 sample_cell.trimmed.R1.fq.gz -2 sample_cell.trimmed.R2.fq.gz \
--unique-only \
--base-change C,T \
--no-spliced-alignment \
--no-temp-splicesite \
-t --new-summary --summary-file hisat_3n_summary.directional.txt \
--threads 1 \
--directional-mapping-reverse \
-S leo_directional_reverse.sam

Multiseed full-index search: 00:00:24
HISAT2 summary stats:
    Total pairs: 50016
        Aligned concordantly or discordantly 0 time: 17831 (35.65%)
        Aligned concordantly 1 time: 29608 (59.20%)
        Aligned concordantly >1 times: 1665 (3.33%)
        Aligned discordantly 1 time: 912 (1.82%)
    Total unpaired reads: 35662
        Aligned 0 time: 24465 (68.60%)
        Aligned 1 time: 9045 (25.36%)
        Aligned >1 times: 2152 (6.03%)
    Overall alignment rate: 75.54%
Time searching: 00:00:25
Overall time: 00:00:30

This version of HISAT-3N can map your testing reads without any problem. We will merge it to the hisat-3n branch soon.

For 3: Yes, you are right. HISAT-3N/HISAT2 only output 3 types of MAPQ: 60 is for unique mapped. 1 is for multiple mapped. 0 is for unmapped.

When hisat-3n-table faces multiple mapped reads, the output table will count all positions it mapped.

Please let me know if you have any other questions or if you find any bugs in the developing version of HISAT-3N. Thanks again for using HISAT-3N.

Leo

lhqing commented 2 years ago

Hi Leo, @imzhangyun Thank you for the quick reply! I tried the hisat-3n-dev-directional-mapping-reverse branch, and it worked great for me, making hisat-3n even faster. I will let you know if I find any issues when running this version on larger datasets later. Hanqing