IanevskiAleksandr / sc-type

GNU General Public License v3.0
237 stars 45 forks source link

Could the detection of oligodendrocytes be possibly improved? #33

Open LASeeker opened 1 year ago

LASeeker commented 1 year ago

Hi Aleksandr, I just tested your sc-type method on my dataset (https://pubmed.ncbi.nlm.nih.gov/37217978/) and it works really nicely for most cell types. So thank you for that! I am showing below my first rough annotation (unknown turned out to be immune cells) followed by the annotation using cell-type.

You will see that sc-type performed really very well, however, it did not recognise oligodendrocytes (which happen to be the main focus of our lab). Would it be possible to add to the gene database to improve the detection of oligos? We would be happy to suggest additional marker genes. PLP1 may be a good one for example. Also, the detection of cerebellar granule cells (RELN +) was not perfect.

Cool tool, thank you!

image image
IanevskiAleksandr commented 1 year ago

We are planning to make sctype 2.0 early next year (but it can appear on GitHub much earlier) with addition of many new cell types and analyses options. Thanks for the suggestions. Please send more markers with corresponding references if you have those.

pedriniedoardo commented 1 year ago

Interestingly we had a similar situation in our lab. We noticed that the issue was originating by the fact that not all the marker genes were making the cut of the HVG. To keep the object slimmer, we do not scale all the features in the object. And since the tool relies on the extraction of the scale.data slot, if the genes are not there, the scoring is affected. In particular, we noticed that when exploring the scale.data slot, not many genes were present from ScTypeDB_full.xlsx.

For the positive markers.

lapply(gs_list$gs_positive,function(x){
  sum(rownames(scobj[["RNA"]]@scale.data) %in% x)
})
$Astrocytes
[1] 8

$`Cholinergic neurons`
[1] 0

$`Dopaminergic neurons`
[1] 1

$`Endothelial cells`
[1] 10

$`GABAergic neurons`
[1] 3

$`Glutamatergic neurons`
[1] 2

$`Immature neurons`
[1] 0

$`Immune system cells`
[1] 0

$`Mature neurons`
[1] 2

$`Microglial cells`
[1] 7

$`Myelinating Schwann cells`
[1] 0

$`Neural Progenitor cells`
[1] 0

$`Neural stem cells`
[1] 0

$Neuroblasts
[1] 0

$`Neuroepithelial cells`
[1] 1

$`Non myelinating Schwann cells`
[1] 0

$`Oligodendrocyte precursor cells`
[1] 4

$Oligodendrocytes
[1] 0

$`Radial glial cells`
[1] 5

$`Schwann precursor cells`
[1] 0

$`Serotonergic neurons`
[1] 0

$Tanycytes
[1] 0

$`Cancer cells`
[1] 1

$`Cancer stem cells`
[1] 0

The quick and dirty solution we used, was to run ScaleData again, specifying the features of interest.

# -------------------------------------------------------------------------
# run an ad hoc scaling to include the genes for the cell type annotation
scobj_test <- scobj %>%
  # I can scale the missing features afterwards now focus on the highly variable one for speed purposes
  ScaleData(vars.to.regress = c("percent.mt.harmony","nCount_RNA.harmony","S.Score","G2M.Score","origin","facility"), verbose = T,features = unique(unlist(gs_list))) %>% 
  identity()

dim(scobj_test@assays$RNA@scale.data)

es.max <- sctype_score(scRNAseqData = scobj_test[["RNA"]]@scale.data, scaled = TRUE, 
                        gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)
# -------------------------------------------------------------------------

Eventually, the pool of markers genes for Oligo was better represented. For positive markers

lapply(gs_list$gs_positive,function(x){
  sum(rownames(scobj_test[["RNA"]]@scale.data) %in% x)
})
$Astrocytes
[1] 15

$`Cholinergic neurons`
[1] 2

$`Dopaminergic neurons`
[1] 8

$`Endothelial cells`
[1] 12

$`GABAergic neurons`
[1] 6

$`Glutamatergic neurons`
[1] 7

$`Immature neurons`
[1] 6

$`Immune system cells`
[1] 9

$`Mature neurons`
[1] 9

$`Microglial cells`
[1] 26

$`Myelinating Schwann cells`
[1] 4

$`Neural Progenitor cells`
[1] 14

$`Neural stem cells`
[1] 4

$Neuroblasts
[1] 6

$`Neuroepithelial cells`
[1] 7

$`Non myelinating Schwann cells`
[1] 4

$`Oligodendrocyte precursor cells`
[1] 6

$Oligodendrocytes
[1] 11

$`Radial glial cells`
[1] 11

$`Schwann precursor cells`
[1] 6

$`Serotonergic neurons`
[1] 4

$Tanycytes
[1] 1

$`Cancer cells`
[1] 3

$`Cancer stem cells`
[1] 6
LASeeker commented 1 year ago

Hi, Amazing to hear @IanevskiAleksandr that you are working on further improving sctype! I don't think in my case scaling the data would help because all genes were already represented in the scaled data slot. I also noticed that when I am running sctype on a randomly subsetted dataset (same number of nuclei per manually annotated cell type), it usually performs better and detects oligodendrocytes. So, I think it is not an oligodendrocyte problem per se but something else. Could it have to do with them being the most abundant celltype in the complete dataset? Interesting @pedriniedoardo that you saw have seen something similar. It would be great to hear from the community, if this happens with other cell types, too.