constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
249 stars 34 forks source link

Calculate contamination proportion for each gene #108

Closed ghost closed 2 years ago

ghost commented 2 years ago

I want to calculate the proportion of the count for each gene that could be attributed to ambient contamination. So far, I have used the following approach:

# observed count
obs <- rowSums(x$toc)

# expected count
exp <- rowSums(outer(x$soupProfile$est, x$metaData$nUMIs))

# proportion
p <- exp / obs
p[p > 1] <- 1
p[obs == 0] <- NaN

This approach seems correct to me, but I just wanted to check that I'm calculating the expected counts correctly. I seem to have a lot of genes with very high proportions (> 0.75) and thought I may have misinterpreted what the est column values are?

constantAmateur commented 2 years ago

I'm not sure why you want these values, but the est column is directly the expected proportion of reads for each gene in the soup, normalised to sum to 1.

If you want to know what fraction of reads were actually removed for each gene, just compare the output object with the original object. That is, rowSums(corrected_toc)/rowSums(x$toc)