DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

converting hisat 3N-conversion-table output to CpG methylation counts #318

Closed mjbyun closed 2 years ago

mjbyun commented 2 years ago

First of all, thank you for developing hisat 3N. As a previous bismark user, I am excited by the rapid alignment capability of hisat 3N and its high mapping efficiency.

Using whole genome BS seq data (human), I have performed the alignment (standard index), deduplication (using samtools), and generated 3N-conversion-table which all worked seamlessly. However, I am stuck at this step as I am unsure how to convert the output of the 3N-conversion-table to a format similar to the bismark coverage file. Basically, I need to extract the number of methylated and unmethylated cytosines at each CpG site throughout the genome (plus the coordinate information), with the ultimate goal of creating a BSseq object. Do you have any suggestions as to how to go about this?

Also a related question: the hisat-3n-table manual indicates that using the --CG-only option will only count the CpG island in the reference genome. How do you define a CpG island in your code? Is it possible that --CG-only counts all CpG sites regardless of whether they are located within a CpG island or not?

imzhangyun commented 2 years ago

Hello Minji,

Thank you for using HISAT-3N. Sorry for the inaccurate manual, the--CG-only option will count all the CpG sites rather than the CpG islands. We just updated the HISAT-3N manual.

The 3N-conversion-table contain similar information as Bismarck methylation extractor output. The coordinate information is provided by column "ref" and column "pos". For BS-seq, the number of methylation sites is provided by the column "unconvertedBaseCount" and the number of unmethylation sites is provided by the column "convertedBaseCount". Please notice that some other sequencing technology (e. g. TAPS) converted methylated C to T, so the methylation sites counting is provided by the column "convertedBaseCount".

Best, Leo

mjbyun commented 2 years ago

Thank you Leo for the rapid response. It is very helpful. Good to know the --CG-only counts all CpGs. That makes my job much easier!

Your explanation makes sense (that uncoverted and converted base count correspond correspond to the number of methylated and unmethylated sites, respectively, for BS-seq). Meanwhile, how much does one need to pay attention to the converted/unconvertedBaseQualities? I am not sure what F, B, <, / signs mean and how to interpret them.

imzhangyun commented 2 years ago

Hello Minji,

For the quality score, you can get more information from SAM format specification page 6, Col 11 (Field: QUAL). Basically, each character in this field represent the sequencing quality of one base. We add the "converted/unconvertedBaseQualities" column to match the BSmooth output. Downstream software can adjust the "convertedBaseCount" by the "convertedBaseQualities". As I know, many analysis tools for BS-seq does not use the quality information. I hope this information is helpful.

Best, Leo

mjbyun commented 2 years ago

This is super helpful. Thanks so much, Leo!

mjbyun commented 2 years ago

Hello Leo,

I reopend this issue as I discovered running hisat-3n-table with --CG-only option returns much fewer CpG sites than expected. I am working with human whole genome bisulfite sequencing data at 25x coverage (paired end, Illumina 150bp reads). Here is the samtools flagstat output for my bam file:

$ samtools flagstat -@4 MB1.markdup.bam 571679250 + 0 in total (QC-passed reads + QC-failed reads) 533617494 + 0 primary 38061756 + 0 secondary 0 + 0 supplementary 132261855 + 0 duplicates 132261855 + 0 primary duplicates 523564336 + 0 mapped (91.58% : N/A) 490380439 + 0 primary mapped (91.90% : N/A) 518641574 + 0 paired in sequencing 259320787 + 0 read1 259320787 + 0 read2 456454446 + 0 properly paired (88.01% : N/A) 462119304 + 0 with itself and mate mapped 28261135 + 0 singletons (5.45% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

Here is my command for running hisat-3n-table: $ samtools view -h --exclude-flags 1024 -@4 MB1.markdup.bam |\ hisat-3n-table -p 4 --ref hg38-canonical.fa --alignments - --output-name MB1.CpG.tsv --base-change C,T --unique-only --CG-only

The program completes without emitting an error message. However, the MB1.CpG.tsv file contains ~1 million lines/CpG sites (all chromosomes are included), whereas, based on my previous experience at this depth of coverage (25x), >20 million CpG sites are expected. Do you have any explanation? I wonder whether it has something to do with how --CG-only function interact with the reference file? FYI, I am using hg38/GRCh38 assembly.

imzhangyun commented 2 years ago

Hello Minji,

Sorry for the bug again. I just confirmed that we have bug for the --CG-only option. It will take me about 2 days to fix it.

Leo

imzhangyun commented 2 years ago

Hello Minji,

I believe I just fixed the --CG-only bug for hisat-3n-table. Could you pull the hisat-3n code and make it again?

Thanks, Leo

mjbyun commented 2 years ago

Thanks a lot for looking into this issue quickly! I'll try gain and let you know the outcome.

mjbyun commented 2 years ago

--CG-only works! Thank so much.

zhangzhiyangcs commented 2 years ago

Hi Minji, I am also user of bismark and bsmap. But there are different formats compared to hisat-3n table. I want to utilize the downstreams scripts of bsmap or bismark to get result of CHG, CHH sites and correponding statistical test. How do you deal with it? Could you give me some suggestions. Thanks a lot.