broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
299 stars 54 forks source link

Impact of using different cellbender parameters across samples on downstream analysis #271

Open dquintanatorres opened 1 year ago

dquintanatorres commented 1 year ago

Hi @sjfleming and team!

First of all, thanks for creating such a useful tool. While analyzing data that we generated along with some external public datasets, a question came to mind. Are there any cellbender parameters that, if used differently across samples, could bias downstream analysis? From what I understand from the paper, cellbender does an excellent job at correcting spurious differential gene expression caused by background noise. However, I’m curious if using different parameters between samples could introduce some biases/differences. My logical reasoning tells me that adjusting parameters such as the number of epochs and learning rate to achieve near-optimal training in each sample wouldn’t matter due to the independent nature of each sample. On the other hand, I suspect that using a different false positive rate (FPR) to remove varying amounts of noise could lead to differences across samples introduced by the background removal procedure. Is my understanding correct?

Thanks again!

sjfleming commented 1 year ago

Hi @diego-qt , this is a very good question, and I think it deserves a long answer and maybe even a few experiments on my part.

I'll post a short partial answer for now, and get back to you with a longer answer. :)

TL;DR If possible, it's always ideal to use the same version of cellbender and the same FPR on all samples. Other parameters, like learning rate, etc, can be varied.

Longer answer:

I think your intuition is correct. One of the guiding principles for writing cellbender was that it should aim to "do no harm" to the data, and touch the data as little as possible to achieve denoising. But in practice, this aim is not going to be achieved perfectly. The FPR parameter is a knob which people can use to remove "more noise". This involves a trade-off inherent in any noise removal procedure: you remove more noise at the expense of some signal. A good noise-removal algorithm will maximize the removal of noise while minimizing the removal of signal, but still, there will be some signal removed as you increase FPR.

A lot of thought went into this question as we went from v0.2.0 to v0.3.0. In v0.2.0, higher FPR values tended to over-remove counts of genes which are responsible for a lot of noise. The result was that, if you really crank up FPR, you end up over-removing certain genes (compared to the truth). In v0.3.0, we tried to address this head-on by explicitly requiring that the per-gene total count removal must adhere to certain expectations, even as the FPR is cranked up.

Let me try to be a bit more clear by referring to some supplementary figures from the paper. https://www.nature.com/articles/s41592-023-01943-7/figures/16 This figure is a bit technical, but it highlights the issue.

This is what was going on in v0.2.0:

image

and this is v0.3.0:

image

What we're showing here is a series of plots where each dot is a gene ("beta" and "nFPR" are both ways of "cranking up the FPR"). The interpretation here is that, if you increase FPR in v0.2.0, you get more removal of the highly-expressed (noisy) genes. In v0.3.0, as you increase FPR, you get proportionate increases in removal of every gene (the fact that it looks like a horizontal line).

So that's one piece of information, but it doesn't answer your question.

The closest we came to trying to answer your specific question is probably this figure: https://www.nature.com/articles/s41592-023-01943-7/figures/14 What we're showing is that, with simulated data where the "truth" is that no genes are differentially expressed, you get some false discoveries due to background noise if you analyze the raw data:

image

and then if you analyze cellbender-processed data, all samples using FPR 0.01, then these false discoveries disappear:

image

What we did not do is try to come up with a scenario where we intentionally used different cellbender settings on different samples in order to see if new false discoveries can be induced...