Strand-bias: only apply if distribution is opposite?

andreas-wilm commented 8 years ago

Matters esp. for high coverage data where huge numbers make any diff look significant Implement and benchmark and first

rspreafico-vir commented 4 years ago

This is indeed pretty important. I run myself into some variants filtered out because of strand bias that didn't look that extreme. I noticed strand bias may occur more easily at the two ends of an amplicon. What would be the best way to mitigate? I think getting rid of strand bias filters completely would be too much, but as you pointed out, maybe it'd be best not being too sensitive with high coverage data.

andreas-wilm commented 4 years ago

It's difficult to generalize here. We've also observed more strand-bias toward the ends of amplicons. Strandbias is generally higher in highly PCRed data. And it highly correlates with non-reproducible variants, e.g. in replicates. I guess we would need a different default settings for different types of data (exome, WGS, amplicon). I'd love to leave it up to the user though, because these settings are always just an approximation.

rspreafico-vir commented 4 years ago

Yes, that makes sense. One point is that the Fisher test becomes increasingly sensitive to smaller and smaller differences as coverage increases. If I am not mistaken, there is an additional threshold for absolute % of reads on either strand. At amplicon ends, where nature of assay makes one of the two strands more dominant (regardless of REF and ALT), that threshold based on absolute % of reads is often met. However, that is met for both REF and ALT. At that point, Fisher kicks in, and if it's high coverage data, the variant gets killed.

Makes me think whether it would be possible to build the significance test for strand bias by checking that the difference between REF and ALT is greater than X, as opposed to being different from zero. Kind of same reasoning for DEGs in RNA-seq, checking whether logFC is greater than X as opposed to just being different from zero. This would protect against strand imbalance affecting both REF and ALT in high coverage data.

The simpler alternative would be to keep the Fisher test as-is, and just add an absolute % threshold of difference between REF and ALT in strand usage (taking into account AF).

andreas-wilm commented 4 years ago

Yes, you are right. As coverage increases the Fisher test becomes very sensitive. We have an ad-hoc solution in lofreq filter, called compound filter:

$ lofreq filter
...
Strand Bias (SB):
  Note, variants are only filtered if their SB pvalue is below the threshold
  AND 85% of variant bases are on one strand (toggled with --sb-no-compound).
 ...

While this addresses some of the issues, it's a bit arbitrary. Open to suggestions here.

rspreafico-vir commented 4 years ago

Thanks @andreas-wilm! I think the compound filter is key. How about making that even more flexible with a couple of changes?

testing the percent strand difference between reference and alternate bases, as opposed to percent strand variant bases
allowing to set that percent difference dynamically, as a parameter

Please let me know your thoughts.

andreas-wilm commented 4 years ago

Yes, that's a possibility. There is a chance that I'll implement this in LoFreq3 rather than here, to avoid duplication of effort, but I'll leave this ticket here open

rspreafico-vir commented 4 years ago

Works for me! Thank you!

CSB5 / lofreq

Strand-bias: only apply if distribution is opposite? #41