AntonioDeFalco / SCEVAN

R package that automatically classifies the cells in the scRNA data by segregating non-malignant cells of tumor microenviroment from the malignant cells. It also infers the copy number profile of malignant cells, identifies subclonal structures and analyses the specific and shared alterations of each subpopulation.
https://www.nature.com/articles/s41467-023-36790-9
GNU General Public License v3.0
90 stars 25 forks source link

Filtered cells in pre-filtered dataset #110

Closed EmanueleRosatti closed 5 months ago

EmanueleRosatti commented 5 months ago

Hi, thanks for the great tool!

I have a question about the filtering step performed by the algorithm. In the documentation and in the original publication, it is mentioned that cells with less than 200 expressed genes are filtered out during the pipeline. I recently run the algorithm on a pre-filtered dataset, where among other filters I filtered out cells with less than 200 genes, like this:

seurat_sub <- subset(seurat_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & nCount_RNA > 500 & log10GenesPerUMI > 0.8 & percent_mt < 20)

However, I still get a sizeable number of filtered cells

table(seurat_sub_SCEVAN$class)

filtered normal tumor 17326 41333 53897

I was wondering where do these filtered cells come from. Is there any addittional filter applied on cells in the algorithm? I am uncertain on how to proceed here, because the number of filtered cells here is a significant percentage of my dataset.

AntonioDeFalco commented 5 months ago

Hi @EmanueleRosatti, Yes, as you can see during the pipeline output and looking at the public code, in addition to cells filtered by gene number, other cells are filtered out if they do not have at least 5 genes expressed on each chromosome (a necessary QC condition for copy number analysis).
Regarding the percentage of filtered cells, I see that in your sample, there are about 112 thousand cells; how come such a huge number?

Thanks for your appreciation.

Regards

EmanueleRosatti commented 5 months ago

Thanks for the response. I missed the part about the 5 genes expressed per chromosome. It's strange that cells with such an "uneven" gene expression are passing the other QC filters, but I guess there is not much to do about that.

As far as the number of cells in the dataset, this is a multi-sample object with 22 samples. As suggested, I performed the SCEVAN pipeline for each sample separately, but then I collapsed the results in a single object to be able to visualize the classification and the CNAs on the integrated map.