MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
39 stars 23 forks source link

Cyclone Computational Efficiency #118

Closed DarioS closed 4 months ago

DarioS commented 4 months ago

I notice the running time is long for typical-sized data sets. Will that be addressed in libscran? It would be good to modernise, if not.

> dim(allHumanFibro)
[1] 21880 34543
> system.time(cycleStages <- cyclone(allHumanFibro, genePairs, gene.names = rowData(allHumanFibro)$ID))
     user    system   elapsed 
12839.273     0.858 13444.817

It took almost four hours. Perhaps it was originally designed for Smart-seq data.

LTLA commented 4 months ago

It was developed even before my time. So, Fluidigm.

It's probably not going to get any faster, because:

DarioS commented 4 months ago

Interesting. My goal is to label cells by <cell type, cell state> and then associate proportions to chemotherapy response. <cancer, cycling> <cancer, not cycling> <fibroblast mCAF, cycling> <fibroblast mCAF, not cycling> <fibroblast iCAF, cycling> <fibroblast iCAF, not cycling> I shall do single cell scoring using a different reference. My colleague uses Seurat's cc.genes.updated.2019. Fingers crossed.

LTLA commented 4 months ago

FWIW if you follow the trail of references, I think you will find that Seurat's classifier is based on HeLa data, with some indirect contribution from HEK data. HeLa is a pretty wild system IIRC, barely human at all; though it is pretty popular as a model "organism" for studying cell cycle regulation and mechanisms, so maybe it is still relevant for this purpose. Guess we'll never know, I've never seen any experimental validation of the cell cycle scores.

If you just want cycling/non-cycling, explicit phase assignment is overkill. In fact, I bet this won't even give you proper "non-cycling"; most methods won't have a G0 state in their training data, and I'd be surprised if G1 and G0 were transcriptionally identical. Rather, you may prefer some form of subclustering on each cell type, possibly using only the cell cycle genes, and then manually annotating each subcluster as "cycling" or not. This is the general approach we used for mass cytometry, though the beautiful separation between cycling/non-cycling cells was due to the IdU marker.

Or fancier: i.e., do a PCA on all cycling genes within the (sub)population of interest, reconstruct the rank-1 matrix, take the column sums to obtain a "cell cycle activity", and then test for differences in the distribution along this axis between conditions. You can check out the ScoreFeatureSet function in libscran for more details.