JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

Filtering of identified CNVs #36

Open Josephinedh opened 4 years ago

Josephinedh commented 4 years ago

Hi,

I've been using your tool to identify CNVs in 10x Genomics scRNA-seq data. However, I have one question regarding the filtering of identified CNVs. When I run the calcAlleleCnvProb function on a region I end up with the majority of the cells assigned a probability around 0.5 while very few cells are assigned a probability closer to either 0 or 1. In the integrated tutorial I see that you use 0.9 as a cut-off to filter out the CNVs with low probabilities but that this is based on full-transcript coverage. But I'm not sure where to set the cut-off in my dataset.

So my question is, is there any way I can improve the CNV probability calling to get a seperation between cells with and without a CNV? I have analyzed another dataset with higher coverage per cell and I see that the probabilities are better separated for this dataset so is this issue just a result of low coverage?

Thanks!

JEFworks commented 4 years ago

Hi,

Thanks for using HoneyBADGER.

A few questions for you that may help me better address your questions:

  1. Are you using the allele, expression, or joint model for identifying CNVs?
  2. For the expression-based model, what are you using as the normal normalization reference?

You are correct that in both the original manuscript and the online tutorial, I used a rather stringent 90% poster probability cutoff to identify confident CNVs. We have not systematically evaluated the sensitivity and specificity of our CNV calls to different cutoffs, particularly when working with 3' or 5' only data, so I cannot provide a quantitative answer as to where to set the optimal cutoff for 10X data. Indeed, an ideal approach to determine the optimal cutoff would be to analyze another higher coverage dataset with known CNVs and downsample that dataset's coverage to something more comparable to your 10X dataset to see what is the distribution of posterior probabilities when trying to identify the known CNV. You can then actually assess the accuracy of identifying the known CNV for different cutoffs and pick something that you're comfortable with.

Ultimately, our confidence in the identified CNVs is a function of coverage. If we have full transcript coverage, we are able to be more confident about smaller CNVs. Conversely, if we have 3' or 5' only data, in order to achieve the same level of confidence, we would need larger CNVs. When there is lower coverage, we just have less information to assess and end up generally being less confident about identified CNVs.

However, if the goal is to simply separate cells with and without CNVs rather than identifying what specific CNVs each cell harbors, the approach would be less stringent. For example, you can cluster on the posterior probabilities and see a group of cells with generally higher CNV probability than others. Even if the individual posterior probabilities may not reach the 90% cutoff, if there are multiple CNVs, they could in aggregate suggest a genetically distinct subclone.

Hope that helps. Feel free to let me know if you have any additional questions.

Stay healthy and safe, Jean

Josephinedh commented 4 years ago

Hi Jean,

Thanks for your reply. I'm using the allele model. I also have WES available from this sample and I can see almost the same CNV regions with both methods, so that's nice. It seems like for most of the regions it's just a very small population of cells that have a high probability but after looking into the coverage I saw that for many of the cells the coverage is only a few SNPs per chr, so I can see how it's impossible to infer anything from that on a cell-specific level.

Thanks for a nice tool.

Best, Josephine