Feature request: Include expected contamination proportion as input

wmacnair commented 2 years ago

Hi

I am trying out CellBender for the first time, and I'm excited to see how it affects the clustering in our single nuclei RNAseq data.

One thing I would like to include, but couldn't find a way of doing, would be to give a prior to how much contamination is expected in each sample. Especially with snRNAseq data, it would be cool to be able to include e.g. --expected-ambient 0.1 for a sample with 10% mitochondrial reads (some of our samples have substantial contamination...).

My brief reading of the --fpr option suggests that I couldn't use it for that. Is that correct? Is there some neat way of including this information in the current implementation?

If not, it would be great if you could consider including this in a future version :)

Thanks! Will

wmacnair commented 2 years ago

(Another unrelated thought which I didn't want to include as a full issue:

Colleagues have tried out cellranger and found it super useful, although slow, especially for larger files. Could it make sense to do some approximate fast clustering of the "empty" profiles, then take sums? You could have a large number of clusters, so still a lot of variability, but if you reduce the number of "empties" by 100x, I wonder if that could give you a drastic speed-up at little cost.

If this wouldn't work at all, then please excuse my partial understanding of the model!)

sjfleming commented 2 years ago

Hi @wmacnair !

As far as "expected contamination" goes... there is no input parameter currently to specify that: you are correct.

The reason is that the cellbender model tries to figure out that information from the dataset itself. This ends up being one of the main things the model learns: how many counts are in empty droplets and how many counts are in cells... and so it ends up with a good estimate of sample-specific contamination. If you still think that including some information up-front about how much contamination is in a sample, I'm open to continuing the discussion! But as it is now, the model looks at things like average counts in empty droplets (which it estimates up-front) as a prior for how much contamination is in the sample (due to ambient RNA).

The --fpr option has to do with how we construct the final output "clean" count matrix. Once cellbender has determined a posterior probability distribution of how many noise counts are in each entry of the count matrix, it has to make a hard decision: since we do not output distributions for each count matrix entry, we need to summarize each distribution as a single number: the best guess of the "clean" counts in each count matrix element. The --fpr parameter influences how that probability distribution gets distilled down into a single number. (We use posterior regularization in v0.2.0 to remove "more noise" when the FPR number is larger.). Intuitively, --fpr is like "how much signal are you okay with removing, if it means you can also remove more noise"? A value of 0.01 indicates that the algorithm should remove as much noise as it can while keeping the removal of real signal to around 1% (of the real signal). An FPR value of 1 would remove the entire dataset. (0.01 is the current default.)

If you have a dataset that you think is particularly noisy, one way to get a "cleaner" output is to crank up with FPR value. But cellbender already tries to remove approximately as much noise as it thinks is present in the dataset (based on the counts in empty droplets), plus the given FPR on top of that.

So I do think that using the --fpr parameter can kind of get you what you want: if you have a noisy dataset, you can use a larger FPR value to get "more" noise removal. But do keep in mind that the algorithm already tries to remove as much noise as it thinks is appropriate to each dataset. Meaning that if I have two datasets, one clean and one noisy, and I run them both with FPR 0.01, then I would hope to see that the remaining amount of noise in the two datasets is approximately the same. (We plan to make this more precise in v0.3.0)

sjfleming commented 2 years ago

As for speed... yes, speed is always an issue I am thinking about. Version 0.3.0 will be out within the next month, and it does have some speedups. A significant advantage (if you are running in Terra or using Cromwell) is that it will save checkpoint files so that it can be run on preemptible GPUs, at a reduced cost. There are a few speedups aside from that as well, but the runtime is still pretty long. Can still be 30 mins or an hour (on a Tesla T4 GPU) depending on dataset size.

One thing to consider if you have not played around with it: reducing --total-droplets-included will speed things up (linear time complexity). If you can reduce that number, it goes faster. However, you want to be sure to include all the droplets that you think have any chance of being "non-empty": that is to say, anything not included in --total-droplets-included should be "surely empty".

broadinstitute / CellBender

Feature request: Include expected contamination proportion as input #132