FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
379 stars 101 forks source link

Indexing and alignments - Variants Called #366

Closed MMG9 closed 3 years ago

MMG9 commented 4 years ago

We have are performing amplicon Next Gen sequencing of Bisulfite converted DNA. We are interested in the methylation status at the CpG sites in our region of interest. For our analysis, we first perform QC, then align the sequences in Bismark with HISAT2. We then use Samtools/BCFtools to perform mpileup and generate a VCF file. I'm curious about the resulting data in the columns RefF, RefR, AltF, AltR. My initial understanding of these was the following:

Starting sequence - wild type not converted Strand A Strand B (reverse complement to strand A)

Bisulfite convert DNA

RefF = bisulfite converted strand A with full methylation of CpGs. RefR= Bisulfite converted strand A with no methylation at CpG sites, or alternate allele AltF = bisulfite converted strand B with full methylation of CpGs. AltR= Bisulfite converted strand B with no methylation at CpG sites, or alternate allele

However, I just had a set of data where part way through the region of interested, the variants switches columns. Here it seems as though the variants are being read from the forward strand (or read 1) and then halfway through switches to call variants from the reverse strand (or read 2). I understand why this would happen, but the columns the data are reported in do not align with what I was originally thinking the columns meant. Could you provide some further clarification? I've attached an example of what I'm referring to.

Capture (2)

Thank you in advance for any clarification you can provide.

FelixKrueger commented 4 years ago

It's a bit tricky to give you an answer here without having data right in front of me, but I think I've got an idea. Before I get into this I wanted to quickly ask:

Is there a specific reason why you want to go a let's say quite non-standard way of using mpileup and variant calling instead of just looking the methylation levels?

If you were to follow up the Bismark run with bismark_methylation_extractor --bedGraph --gzip mapped_file.bam you would get a Bismark coverage file with all the information you need:

<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>

This file could also be imported into SeqMonk where you could then inspect your data visually, and perform all sorts of quanitations on it (rather than looking at VCF files in Excel...).

Coming back to your original variant calling format:

Bisulfite sequencing will take the top or bottom strands of your DNA, and chemically convert it. So RefF and AltF (forward) should together add up to all the reads that align to the top strand. RefF would mean a methylated residue (e.g. C), AltF would be unmethylated (i.e. T). Similarly, RefR and AltR are all the residues should add up to the total number of reads covering reverse strand reads.

Now I think what pileup might be doing 'wrong' here is that it might simply count both reads coming from top and bottom reads as if they were both present and add the bases to the tally. In methylation calling terms this is not the right to go about this, as you can only use reads to the top strand (OT or CTOT) for top strand calling, and reads aligning to the bottom strands (OB and CTOB) for reverse strand calling. Not sure how or whether this was at all handled in the mpileup command? In addition to that, some amplicon alignments require --non_directional alignments, did you use those? In that case I wouldn't be surprise if the mpileup command would in addition use the FLAG values in the BAM file to determine strands (which it would almost certainly get wrong).

I think what I am saying is: if I were you I would not touch a variant calling approach for this if you can at all avoid it. The Bismark/SeqMonk approach would within say 10 minutes give you all the answers you need for this. I hope this helped?

FelixKrueger commented 3 years ago

Closing this as there was no further interaction.