constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
249 stars 34 forks source link

Cell specific contamination fraction #100

Closed mvfki closed 2 years ago

mvfki commented 2 years ago

Thanks for this great tool! I can now follow the tutorial and reproduce the result. If I'm understanding right about this tool, SoupX uses a global contamination estimation for each channel and redistributes the contamination counts for each cell.

When I tried to read your paper, I found that there's a paragraph in the Supplement Material talking about estimating the cell-specific contamination fraction:

For cases where there is a need to estimate cell-specific contamination, we share information between cells using a hierarchical Bayes model. Under the model: (equation S5-S9)

I'm wondering if you have any code implementation about getting a vector of per-cell contamination instead of having the same rho for all the cells. Or it will also be great if you can provide any workaround so that we can get this kind of information manually.

constantAmateur commented 2 years ago

There is a version of SoupX, available by installing the STAN branch of the code on the github repository, which does this. However, this comes with many limitations.

Firstly, this only works with the "manual" mode, where it is up to you to specify which genes to use to estimate the contamination from. Secondly, there is usually very little information in an individual cell to estimate a cell specific rho, outside of extreme cases like species mixing experiments. The bayesian model accounts for this by using information sharing to smooth out cell specific estimates, but there's only so much that can be done.

Unless you have a strong need for the bayesian model estimates, I would recommend calculating the "effective rho" for each cell and using this as a proxy. What I mean by this is that although SoupX uses a global contamination fraction as input to the correction process, the correction is done at the cluster level and then propagated back to the cell level. For various reasons I won't go into, this propagation process ends up removing an uneven fraction of counts from each cell.

So if you take the total counts for each cell in the corrected matrix divided by the total counts per cell in the original matrix, you will get what is effectively an estimate of the cell specific contamination fraction.

maxim-h commented 1 year ago

@constantAmateur thank you for the tip. I was also wondering about the same thing.

Just to be sure though, to get effective_rho shouldn't it be 1 - total_corrected_counts/total_original_counts?