campbio / celda

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data
http://bioconductor.org/packages/celda
MIT License
148 stars 27 forks source link

Threshold to filter contaminated cells #353

Closed supermegamio closed 2 years ago

supermegamio commented 2 years ago

Hello!

Congratulations for the paper and the R package.

I am using the package in my data (single-nuclei RNA seq) to decontaminate the counts from the isolated nuclei. I use the decontX functions with default parameters and ploted the UMAP to visualise the contamination across the datasets. I observed that the contamination is fairly high across both samples.

Boxplot

UMAP

I was wondering If there is some kind of threshold value to consider if a cell is "too" contaminated to be filtered. Or use a filter such as MAD to eliminate those cells that are too far from the distribution

Thank you very much in advance

joshua-d-campbell commented 2 years ago

Hello @supermegamio, thanks so much! Determining a threshold is a good question and can be tricky. We have used a cutoff of 0.5 for some of our snuc-seq datasets, but it does depend on the dataset. You may want to use something like 0.60 to not loose too many cells in sample "Nuc_2". Often we run through the analysis once with a clustering tool such as Seurat or Celda once without removing any cells (but using the decontaminated counts). Then make a determination of whether the highly contaminated cells should be removed if they may distinct cell populations start to "blend together" on the UMAP. Either way, I would definitely recommend using the decontaminated counts for your down-stream analysis.

A couple of other questions and thoughts. Did you give the sample label to the "batch" parameter? DecontX is meant to be run on each sample separately. If both samples were in the same counts matrix but the "batch" parameter was not supplied, then you may have suboptimal results. Also, sometimes we see some improvement if you supply the "raw" matrix to the background parameter (currently you would have to run decontX for each sample separately to do this). This isn't always necessary though.

Hope that helps!

supermegamio commented 2 years ago

Thank you for your quick and detailed response :)

Yes, I provided to DecontX a vector with the information of the batch for each cell. I will use the raw matrix provided by CellRanger to compare both results.