deweylab / CellO

CellO: Gene expression-based hierarchical cell type classification using the Cell Ontology
MIT License
65 stars 13 forks source link

why every cluster is predicted as "oxygen accumulating cell (CL:0000329)"? #9

Closed hurleyLi closed 2 years ago

hurleyLi commented 3 years ago

Hi, I have a dataset from GSE123814 and I'm trying to re-analyze them using CellO. I normalize the data using the typical approach in your tutorial, and finish cello.scanpy_cello() without error. However, all the ~30 clustered were predicted as oxygen accumulating cell (CL:0000329). I also tried several other datasets, and it seems that CellO only works for PBMC data, but not other tissue types. Could you please comment on why this might happen and how to adjust the training for other tissue types? Thanks, Hurley

mbernste commented 3 years ago

Hi Hurley,

I'm sorry you're facing these issues, CellO should work for other tissue types, not just PBMC (see the original publication, which includes analyses on other tissue types: https://doi.org/10.1016/j.isci.2020.101913 ).

I notice that the data in GEO for the study that you mention consists of raw counts. Given that you are normalizing the data using the tutorial, I assume you are NOT normalizing the data into units of log(TPM+1)? For bulk RNA-seq assays, in which reads are generated from nearly the full length of the transcript, CellO requires expression in units of log(TPM+1). The normalization procedure in the tutorial works only for 3' assays such as 10x single cell data in which log(CPM+1) is equivalent to log(TPM+1).

Best, Matt

mbernste commented 3 years ago

One more quick note on that dataset, which consists of fine-grained T cell subtypes. We found CellO is less accurate on many T cell subtypes (see Figure 6 from the paper), though I would expect CellO to label these at least as T cells (which it usually annotates very accurately). We are currently looking for more data to include in CellO's training set to increase the accuracy on these subtypes.

mbernste commented 3 years ago

Hi Hurley,

One last note, I ran CellO on this dataset after normalizing via the tutorial (which as I mentioned, is technically not correct for bulk RNA-seq samples), but CellO did classify all of the samples correctly as T cells and correctly classified the Naive T cell subtypes:

image

This leads me to believe that the normalization is not the main issue. If you would like, you can send me the code you are using and I can see if I can spot any problems!

Best, Matt