FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Extracting reads aligned to a chromosome #345

Closed ajwije closed 4 years ago

ajwije commented 4 years ago

I am trying to split the bam file using samtools to get reads aligned to the chloroplast genome using the following command: samtools sort S1_combine.deduplicated.bam -o S1_combine.deduplicated.sort.bamsamtools samtools index S1_combine.deduplicated.sort.bam samtools view -b S1_combine.deduplicated.sort.bam chloroplast_1 > chloroplast.extraction.bam

However, the samtools crashes after printing weird characters to the screen. The bam file is an output from duplicate removed step from these two steps: bismark --non_directional $Genome_path -1 S1_combine_R1.fastq \ -2 S2_combine_R2.fastq -o $ALIGN

deduplicate_bismark -p *_bismark_bt2_pe.bam -o S1_combine

Any suggestions would be greatly appreciated.

FelixKrueger commented 4 years ago

Hmm, can't you simply do

samtools view -H  S1_combine.deduplicated.bam > header.txt
samtools view  S1_combine.deduplicated.bam | grep chloroplast > chloroplast_reads.sam
cat header.txt chloroplast_reads.sam | samtools -bS - >chloroplast.extraction.bam

Cheers, Felix

ajwije commented 4 years ago

Thanks, Felix for your quick reply. What you suggested worked and I was able to extract reads aligned to the chloroplast genome. Following is a report from the methylation calls. Is it accurate to say the bisulfite conversion didn't work?

chloroplast.extraction.bam

Parameters used to extract methylation information:
Bismark Extractor Version: v0.22.3
Bismark result file: paired-end (SAM format)
Output specified: strand-specific (default)
No overlapping methylation calls specified

Processed 10691 lines in total
Total number of methylation call strings processed: 21382

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   213239

Total methylated C's in CpG context:    17380
Total methylated C's in CHG context:    14780
Total methylated C's in CHH context:    95012

Total C to T conversions in CpG context:    14073
Total C to T conversions in CHG context:    11432
Total C to T conversions in CHH context:    60562

C methylated in CpG context:    55.3%
C methylated in CHG context:    56.4%
C methylated in CHH context:    61.1%
FelixKrueger commented 4 years ago

Hmm, I am not an expert with plants - but isn't the chloroplast supposed to be unmethylated? .... Does your other data also look dodgy? How do the numbers compare to reports in the literature?

ajwije commented 4 years ago

Yes, according to literature, the chloroplast is supposed to be unmethylated. Other data also looks suspicious. Here is the splitting report for the entire genome. CHG and CHH methylations seem way too high.

For methylation calls, if understand correct, bismark uses 5x coverage by default, correct? is there a way to change it? I couldn't find the option (I am pretty sure I missed it!).

Processed 1073866 lines in total
Total number of methylation call strings processed: 2147732
Final Cytosine Methylation Report
=================================
Total number of C's analyzed:   14918674

Total methylated C's in CpG context:    1024675
Total methylated C's in CHG context:    960082
Total methylated C's in CHH context:    11441855

Total C to T conversions in CpG context:    109747
Total C to T conversions in CHG context:    139670
Total C to T conversions in CHH context:    1242645

C methylated in CpG context:    90.3%
C methylated in CHG context:    87.3%
C methylated in CHH context:    90.2% 
FelixKrueger commented 4 years ago

Bismark doesn't apply any cut-offs, so no worries there. But yes I agree, this looks like a quite failed bisuflite conversion, or possibly a few converted reads mixed with mostly unconverted genomic DNA. I am afraid this looks like you need to do it again...