campbio / celda

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data
http://bioconductor.org/packages/celda
MIT License
148 stars 27 forks source link

High level of estimated contamination for same sample but different set of filtered cells #384

Closed dlee598 closed 1 year ago

dlee598 commented 1 year ago

Hi Team at Celda,

Recently, my cellranger output looked unusual where CR estimated 4,000 cells -> I expected 10,000 cells assuming 20,000 cells were loaded with a 50% capture rate. I wanted to see what happened to my sample so I forced cellRanger to return 10,000 cells so then I could assess "cells" 4000 - 10,000

I ran decontX on both results (default CR and forced 10,000) with both the filtered and raw matrices as input for decontX Surprisingly, for the default results -> the estimated contamination was low (which is great) but the 10,000 set had much higher levels of estimated contamination

image This is the default cells by CR

image This image is the forced 10,000

I have also compared the contamination estimates between the default and forced-10,000 to see what the differences in estimation are for the same cells

image

Where it appears that cells that were estimated to have low contamination now have high levels

Is this behaviour expected or do I need to set parameters in the decontX function?

joshua-d-campbell commented 1 year ago

Hi @dlee598, thanks for using our tool! Just to double clarify, is the scatter plot you are showing for the 4000 cells that were included in both analyses? Can you show the cluster labels that decontX generated in each analysis?

I'm wondering if some cells that were estimated to have lower contamination in the default setting (first analysis) ended up in the same cluster with cells/droplets that were originally excluded (second analysis). You can specify your own cluster labels from another tool such as Seurat using the z parameter which may help with this.

In general it appears that both CelRanger and decontX agree that the extra cells you get by forcing 10K cells look like ambient background. So you may not want to include those in your downstream analysis.

Also, I am going to convert this Issue to a Discussion as other people may have similar questions.