bio15anu / revelio

A lightweight utility for BS-seq and MethylC-seq data which applies a double-masking procedure on bisulfite alignments, facilitating variant calling with conventional software.
MIT License
4 stars 4 forks source link

Work with Bismark? #1

Closed rbpisupati closed 1 year ago

rbpisupati commented 2 years ago

Hello!

This package/script is great considering how poor SNP calling algorithms are on BS-seq libraries. Just wondering, does this also work on Bismark generated aligned bam files? Also, I am assuming this works on both directional or non-directional bisulfite libraries.

I am currently using BCFtools after modifying the aligned BAM file (using python script) and do not end up with any alternative allele calls on bismark generated bam files? Can provide you the exact commands I am using if needed.

Cheers, R

bio15anu commented 2 years ago

Hi, thanks for posting! Yes there is no reason it shouldn't work on Bismark alignment files, unless you are perhaps using the MD tags generated by default instead of redoing them with e.g. samtools calmd? Alternatively, you can give the genome fasta file and skip the MD tag usage.

Otherwise, to rule out any issue with BCFtools does the problem persist when you use alternative variant caller such as Freebayes?

rbpisupati commented 2 years ago

Thank you very much for your response!

I analysed bam files in two ways. First, provide reference genome fasta along with mapped bam file to your script. Second, perform samtools calmd on the aligned bam files and then use your script to modify the bam files.

I think the files which are newly sequenced (non-directional) have bad SNP calling compared to old libraries (directional).

I do not think issue is with BCFtools, as I am just doing genotype calling at provided target loci. Much simpler than standard SNP calling.

rbpisupati commented 2 years ago

I am happy to share mapped bam files from bismark, if you want to run some diagnostics.

bio15anu commented 2 years ago

Something doesn't add up here. More modern libraries (e.g. MethylC-seq) are directional, whereas older BS-seq libraries non-directional, but in any case once they are aligned it should not make a difference unless something has gone wrong during the alignment. Are you certain everything is working correctly during this step? What do your mapping statistics look like?

rbpisupati commented 2 years ago

Yes I think the bismark alignment was fine. Here is the alignment report

Bismark report for: H7MWJDSX3_1#188467_AATCCAATTGCATGTAGAGG_1_val_1.fq.gz and H
7MWJDSX3_1#188467_AATCCAATTGCATGTAGAGG_2_val_2.fq.gz (version: v0.23.0)
Bismark was run with Bowtie 2 against the bisulfite genome of TAIR10_wholeGenome/ with the specified options: -q --score-min L,0,0.5 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)

Final Alignment report
======================
Sequence pairs analysed in total:   7658821
Number of paired-end alignments with a unique best hit: 2518193
Mapping efficiency: 32.9% 
Sequence pairs with no alignments under any condition:  4350420
Sequence pairs did not map uniquely:    790208
Sequence pairs which were discarded because genomic sequence could not be extracted:    0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   651956  ((converted) top strand)
GA/CT/CT:   584233  (complementary to (converted) top strand)
GA/CT/GA:   603959  (complementary to (converted) bottom strand)
CT/GA/GA:   678045  ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   93145547

Total methylated C's in CpG context:    2561668
Total methylated C's in CHG context:    1274057
Total methylated C's in CHH context:    5095582
Total methylated C's in Unknown context:    2

Total unmethylated C's in CpG context:  10967697
Total unmethylated C's in CHG context:  12153405
Total unmethylated C's in CHH context:  61093138
Total unmethylated C's in Unknown context:  0

C methylated in CpG context:    18.9%
C methylated in CHG context:    9.5%
C methylated in CHH context:    7.7%
C methylated in unknown context (CN or CHN):    100.0%

Bismark completed in 0d 0h 15m 18s
bio15anu commented 2 years ago

Okay, the mapping efficiency seems quite low, but how does this compare with the directional libraries? If this is WGBS and assuming 150 bp reads then I'm estimating something like 3X coverage on Arabidopsis genome, or am I mistaken? That doesn't seem enough for what you are looking to do.

rbpisupati commented 2 years ago

Yes, you are right! I realised this is much lower than previous runs (below is for the old library). Shall have to check that with the facility again.

But not sure, why SNP calling would be affected here though (since it should be only mapped reads that matter).

Bismark report for: HHHY7DSX2_2#169198_TACGCTGCTAGATCGC_1_val_1.fq.gz and HHHY7DSX2_2#169198_TACGCTGCTAGATCGC_2_val_2.fq.gz (version: v0.23.0)
Bismark was run with Bowtie 2 against the bisulfite genome of TAIR10_wholeGenome/ with the specified options: -q --score-min L,0,-0.5 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)

Final Alignment report
======================
Sequence pairs analysed in total:   3054311
Number of paired-end alignments with a unique best hit: 2185442
Mapping efficiency: 71.6% 
Sequence pairs with no alignments under any condition:  251526
Sequence pairs did not map uniquely:    617343
Sequence pairs which were discarded because genomic sequence could not be extracted:    14

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   1083797 ((converted) top strand)
GA/CT/CT:   0   (complementary to (converted) top strand)
GA/CT/GA:   0   (complementary to (converted) bottom strand)
CT/GA/GA:   1101631 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   95072549

Total methylated C's in CpG context:    3371890
Total methylated C's in CHG context:    890011
Total methylated C's in CHH context:    1115687
Total methylated C's in Unknown context:    3628

Total unmethylated C's in CpG context:  10146802
Total unmethylated C's in CHG context:  12628506
Total unmethylated C's in CHH context:  66919653
Total unmethylated C's in Unknown context:  59830

C methylated in CpG context:    24.9%
C methylated in CHG context:    6.6%
C methylated in CHH context:    1.6%
C methylated in unknown context (CN or CHN):    5.7%

Bismark completed in 0d 0h 9m 37s

Thank you for your help though :) much appreciate

bio15anu commented 2 years ago

No worries, glad to help. Yes I agree it's strange, as it also doesn't seem to me like enough coverage for mapped reads in the directional libraries either.

Since you mentioned you are looking at pre-selected target loci, could you estimate the coverage in each case over those regions specifically? Perhaps the sequencing depth distribution is such that those regions just happen to be adequately covered in the directional library, whereas not in the non-directional library? If it's below the range of 10-20X then you are unlikely to get good results.

bio15anu commented 1 year ago

I am closing this now as I think the issue is stale. Please let me know if there is anything else to be done.