FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
382 stars 101 forks source link

[Questions] Loci present in the bedgraph file and coveage file are not present in cytosine_report #632

Closed VicSkuldy closed 11 months ago

VicSkuldy commented 11 months ago

Hi, when I run Bismark methylation extraction step, I have found a locus: chr10 131265805 , I can find 2 reads of this site in CpGcontext*_bismark_bt2_pe.txt.gz:

C00116:162:HHYJHDRX3:1:1221:27335:8907_1:N:0:AGCGAGAT+GATACTGG  -       chr10   131265805       z
C00116:162:HHYJHDRX3:1:1221:27263:8970_1:N:0:AGCGAGAT+GATACTGG  -       chr10   131265805       z

Also, I can fine this site in bedgraph and coverage output file: bedgraph output:

chr10   131265804       131265805       0

coverage output:

chr10   131265805       131265805       0       0       2

However, I can not find information about this site in final cytosine_report: ${sample}_bismark_bt2_pe.CpG_report.txt.gz I feel strange that in my understanding, according to https://felixkrueger.github.io/Bismark/options/methylation_extraction/ , after the conversion to bedGraph has completed, the option --cytosine_report produces a genome-wide methylation report for all cytosines in the genome. The output considers all Cs on both forward and reverse strands regardless of its reads number. So why does this site not present in ${sample}_bismark_bt2_pe.CpG_report.txt.gz? Perhaps I misunderstood something? My reference genome is hg19 and this is my command line:

bismark_methylation_extractor -p --gzip --bedGraph --CX --buffer_size 10G --cytosine_report --comprehensive --genome_folder ${ref_path} ${bam_path}/${sample}.deduplicated.bam -o ${out_path}

Looking forward to your kind reply!

FelixKrueger commented 11 months ago

My guess is that you found this position in the methylation extraction process due to a deletion just downstream of the at position 805 which results in a context change. Since the cytosine report strictly uses the genome reference sequence to determine the context, you may not find the position present in the cytosine report. I have written up a little FAQ on this here: https://felixkrueger.github.io/Bismark/faq/context_change/

Since the position in questions appears to be a stretch of multiple Cs in a row I wouldn't be surprised to find a 1bp deletion in the two reads you linked above. (you could test this by grepping for the header line C00116:162:HHYJHDRX3:1:1221:27335:8907 in the BAM files, and I would predict you'll be able to find 1D somewhere in the CIGAR string...