CA and GA conflation - Githubissues

FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states

http://felixkrueger.github.io/Bismark/

GNU General Public License v3.0

392 stars 102 forks source link

CA and GA conflation #613

Closed anirudhjay closed 1 year ago

anirudhjay commented 1 year ago

Hi! I would like to understand how one decides if a sequenced read is CT conflated or GA conflated. This is important for me as I would like to count allele frequencies at particular locations in the genome. Based on looking at SAM files of bismark alignments it seems that those reads with XG:GA are GA conflated while XG:CT are CT conflated. Is this always the case? Can you let me know why this might be?

Thanks! Anirudh

FelixKrueger commented 1 year ago

I am not quite sure I understand the issue properly, is there chance you meant 'converted' instead of 'conflated'?

To determine the conversion state of a read, reads are converted in both a CT (top strand) and GA (bottom strand) manner, and if one of the alignments is best - this conversion state is chosen, and recorded in the XG flag. Does that make it clearer?

anirudhjay commented 1 year ago

Hi Felix,

I apologize if I wasn't clear. Yes, I do mean converted. The reason I used the term conflated was because when I count allele frequencies at particular position of a read, say in CT converted reads, I will not be able distinguish if the T was an actual nucleotide variant or just a Cytosine converted to a Thymine due to the Bisulphite treatment. Hence, I collect them as a combined allele C_T or in the other case G_A.

So, I just wanted to confirm that if I have a read with a XG:GA status, I will not be able to distinguish between an A at position X as a nucleotide variant or a Bisulphite treatment induced conversion (given that the ref seq has a Guanine at the same position).

I hope I have provided a bit more clarity

FelixKrueger commented 1 year ago

Yes, that's correct. For single read you cannot say whether a T at a C position is a methylation state, or a nucleotide variant. In theory it is possible to identify nucleotide variants by looking at the opposing strand though, as you would find an A if there was a SNV, but would still find a G if there is no mutation and you were looking at a methylation state. For this approach to work you will need sufficient reads at a sufficient coverage though. Some tools to look at this are methylcoder of BisSNPer.

anirudhjay commented 1 year ago

Hi Felix,

Yes, as of now I am doing exactly what you proposed. ( Looking at CT converted strands for G and A variants and GA converted strands for C and T variants). Thanks for the suggestions !

Best, Anirudh