bio15anu / revelio

A lightweight utility for BS-seq and MethylC-seq data which applies a double-masking procedure on bisulfite alignments, facilitating variant calling with conventional software.
MIT License
4 stars 4 forks source link

Allele frequencies skewed by masking #5

Open lnblum opened 4 months ago

lnblum commented 4 months ago

Hello,

I've been using your Revelio with Freebayes to call variants on methylation libraries. We were able to get good agreement between our methylated libraries and benchmarking variant dataset. I'm trying to understand the implications for calculating allele frequencies. I've attached an example of a C->T mutation below, that is heterozygous with roughly half-and-half allele frequencies in the truth set. We call it correctly as a heterozygous C->T, with counts of 3 (alt) and 24 (reference) with an allele frequency of ~0.1. This is because the C's on the positive strand are not masked but the Ts are. In this case it seems that it makes sense to calculate the allele frequency only with the "negative" alignments.

I'm just curious if you considered these types of scenarios in your development and what reasoning you came to. Thank you! Laura

Screenshot 2024-04-22 at 12 12 54 PM
bio15anu commented 4 months ago

Hi Laura,

Thanks for your interest in Revelio. The main aspects we considered when testing the method were the implications of moving from a high information (standard DNA library) to a low information (masked bisulfite) system, and the effect of strand bias in C->T mutation contexts. To preface: no matter what we do, we will never be as good with bisulfite data as with a standard DNA library, just by nature of having less information in the system to work with. Our goal was always to develop a heuristic which is "good enough" to justify not having to sequence both a standard and a bisulfite library in parallel just for the purpose of calling variants alongside methylation data.

What you have here is an example that we have considered in terms of strand bias. When REF and ALT alleles are not balanced on the forward and negative strands, you might consider it an error if a heterozygous C->T SNP is on the threshold where after masking it is not called (i.e. read as homozygous REF) because the majority of its ALT alleles were located on the forward strand (and thus masked). However, when compared to an exactly equivalent standard dataset, which controls for other library differences (sequencing depth, mapping and so on), we tend to observe that "false negative" calls driven by strand bias in this context are often already suspicious before masking. So yes, the masking can exacerbate existing issues with strand bias, but there is also some ambiguity over the validity of including such loci in the truth set to begin with. Are you controlling for such library differences with your benchmarking variant dataset?

In regard to the suggestion to recalculate AF based on only the negative strand, I would not advocate to include this as part of the Revelio method itself. Instead I would advocate to consider applying this perhaps on a case-by-case basis in the post-processing or analysis step, once the user has obtained their results and identified the variants to focus on further in their study? One problem I see is that strand bias may also occur in the opposite direction. Consider an example where the majority of REF alleles are on the forward strand, and the majority of ALT alleles on the negative. Such a case should be a heterozygous C->T in a standard library (although again suspicious to begin with), but following masking if you count again only the negative strand you would potentially call it homozygous ALT (i.e. a false positive). For every example like that which you have presented, where you might "rescue" the AF somewhat and potentially avoid a false negative, the trade-off is that you might also create a new false positive in its place.

For this reason, for the method itself I prefer to retain as much information in the system as possible, and to instead reconsider what AF means in a bisulfite library (particularly in C->T context) before applying standard filtering thresholds, and so on. You can also mitigate any potential issues with low frequency alleles close to the threshold of being a false negative, that are otherwise not subject to strand bias, by increasing the overall sequencing depth relative to a standard DNA library. For example if you would normally sequence 30x in a standard library for variant calling, then try 40x for bisulfite. We discuss this to some extent in the publication.

Hope this helps! Please don't hesitate to ask if you have further questions.

lnblum commented 4 months ago

Hi,

Thanks for your thorough response!

I have noticed incidents of strand bias, and I can see how with this method there is not much you can do to 'fix' data in those cases. Now I also want to try to compare this in standard DNA libraries to see if the strand bias appears for certain variants there too.

In the example I shared, though, I don't see the main issue is strand bias. The allele frequency is thrown off because all the C's in the positive alignments are counted but the T's (some of which could be unmenthylated Cs or C>T mutations), are ignored since we don't know their source. The observations are accurate, but you can't extend that to a proportion or allele frequency since we are masking part of the population among the positive strand alignments. By using the negative alignments to calculate allele frequency, you would be closer to representative since none of those bases are masked. Strand bias would still be a problem in cases where that exists.

If I'm understanding right, Revelio is designed to maximize the information kept, which I think is totally valid. I'm just highlighting that there may be implications for calculating allele frequency, and in particular, you may not accurately be able to calculate an allele frequency without further post-processing. Do you agree?

Thanks, Laura

bio15anu commented 4 months ago

Yes I agree the allele frequency on single samples can be skewed, but as reasoned in the previous post, from my perspective this is only an issue for Revelio if it results for example in a false negative call (i.e. a heterozygous C->T SNP is interpreted wrongly as a homozygous REF by Freebayes). Undoubtedly this can happen, but typically seems driven by pre-existing strand bias or because we effectively lose some sequencing depth when masking, relative to an otherwise exactly equivalent standard library.

If it is important in your analysis to report the unskewed AF (for single samples) in this substitution context, I would advocate to make any such suggested adjustments as a post-processing procedure (rather than a change to Revelio), for the reasoning given in the previous post.

I think there is validity to the approach if you tend more to think of it as some kind of normalisation technique, but I would also emphasise some caution in that you consider variants for adjustment on more of a case-by-case basis, rather than simply applying a blanket recalculation on all variants based on the negative strand. While you may bring a decent majority into a range which is closer to expected, it will potentially also introduce errors on variants which are influenced by strand bias, like those mentioned in the previous post. Since the bisulfite treatment dictates that we cannot really know the extent of strand bias from the data alone, it remains unclear to me how one could really measure the difference in accuracy. I think you would need two exactly equivalent datasets, one bisulfite and one standard DNA, and in our case at least the only way I could think to achieve this was with simulated data.

Cheers, Adam