Bismark_methylation_extrcator coverage file

shaghayeghsoudi commented 1 year ago

Hi Felix,

Thanks again for awesome softwares. I just proceed with my Diagenode data according to your recommendations and called methylation. Sorry if it is a naive question but I thought If I can ask your opinion about that (thanks a lot in advance and sorry for keep asking questions). I have 150 bp paired end reads and aligned them on --pbat mode using Bismark. Then for the methylation extraction I used the following command:

bismark_methylation_extractor -p ${BAM} --genome ${REF_DIR} -o ${OUTPUT_DIR} --bedGraph --counts --comprehensive --no_overlap

I am getting all the files I should get (*.cov, bedGraph, etc) and my coverage file from "deduplicated Bam files" look like this:

head SRC160-T1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov
1   10498   10498   0   0   1
1   10526   10526   100 1   0
1   10543   10543   100 1   0
1   10564   10564   0   0   1
1   10572   10572   0   0   1
1   10578   10578   0   0   1

As you see very low coverage! I also visualized my bam files using IGV. So using deduplicated Bam files seems weird with very low coverage.

igv_snapshot

SRC127-N7_R1_val_1.fq.gz FastQC Report.pdf SRC161-N_R1_val_1.fq.gz FastQC Report.pdf

I used the following Bismark command for the alignment bismark --pbat -1 ${SAMPLE_R1} -2 ${SAMPLE_R2} --bowtie2 --bam --temp_dir ${TEMP_DIR} --genome ${REF_DIR} -o ${OUTPUT_DIR} --un

I should mention I get at least 70% mapping efficiency for 90% of my samples and here is one example [SRC161-N_R1_val_1.fq.gz FastQC Report.pdf](https://github SRC127-N7_R1_val_1.fq.gz FastQC Report.pdf .com/FelixKrueger/Bismark/files/12865149/SRC161-N_R1_val_1.fq.gz.FastQC.Report.pdf)

Sequence pairs analysed in total:   44599492
Number of paired-end alignments with a unique best hit: 31242107
Mapping efficiency: 70.1% 
Sequence pairs with no alignments under any condition:  5120832
Sequence pairs did not map uniquely:    8236553
Sequence pairs which were discarded because genomic sequence could not be extracted:    1

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   0   ((converted) top strand)
GA/CT/CT:   15485693    (complementary to (converted) top strand)
GA/CT/GA:   15756413    (complementary to (converted) bottom strand)
CT/GA/GA:   0   ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   1717840085

Total methylated C's in CpG context:    161154477
Total methylated C's in CHG context:    7736429
Total methylated C's in CHH context:    14267469
Total methylated C's in Unknown context:    14412

Total unmethylated C's in CpG context:  109491259
Total unmethylated C's in CHG context:  435737295
Total unmethylated C's in CHH context:  989453156
Total unmethylated C's in Unknown context:  584751

C methylated in CpG context:    59.5%
C methylated in CHG context:    1.7%
C methylated in CHH context:    1.4%
C methylated in Unknown context (CN or CHN):    2.4%

I really appreciate of If could get your opinion on this issue just to make sure at least I am not missing anything related to using the softwares efficiently. Thank you

FelixKrueger commented 1 year ago

If I remember correctly then your data is a RRBS, is that correct? RRBS typically works by digesting the genome using MspI which results in a number of fragments (maybe something like 600-800K), so you are expected to sequence the very same fragments over and over, which a function of sequencing depth. In other words: it is not recommended apply deduplication on RRBS data as you will lose the majority of your data. The exception to this rule would be if you used UMIs to allow the distinction between genuine fragments (coming from different cells) and PCR duplicates (which would all carry the same UMI).

Here is an RRBS Guide that explains a few more things (hopefully).

shaghayeghsoudi commented 1 year ago

Thanks Felix, Yes. That makes sense.

FelixKrueger / Bismark

Bismark_methylation_extrcator coverage file #631