Cyclone Computational Efficiency

DarioS commented 4 months ago

I notice the running time is long for typical-sized data sets. Will that be addressed in libscran? It would be good to modernise, if not.

> dim(allHumanFibro)
[1] 21880 34543
> system.time(cycleStages <- cyclone(allHumanFibro, genePairs, gene.names = rowData(allHumanFibro)$ID))
     user    system   elapsed 
12839.273     0.858 13444.817

It took almost four hours. Perhaps it was originally designed for Smart-seq data.

LTLA commented 4 months ago

It was developed even before my time. So, Fluidigm.

It's probably not going to get any faster, because:

It's already written in C++. Several years ago I thought there might be some possible algorithmic improvements, but was unable to implement them. I'm probably not going to spend the time to try again, because...
I think cell cycle classification is a minor scam. I don't trust the assignments from any method that relies on an ESC reference dataset, because I suspect that the "cell cycle signature" is quite variable across cell types. But putting that aside, I also don't think highly of the way that the cell cycle assignments are used. I see them being applied to regress out the cell cycle effect, which raises all sorts of problems related to linearity assumptions, confounding with cell type, etc.
Nonetheless, if we must get cell cycle phases, you can pull out some reference datasets from scRNAseq with cell cycle information (IIRC BuettnerESCData and LengESCData for human and mouse, respectively), subset them down to cell cycle genes as described in the OSCA book (e.g., filtered on GO:0007049) and use that in your favorite efficient single-cell classification algorithm, e.g., SingleR. This is effectively what cyclone() does anyway. I like this approach because, if nothing else, it makes people think about the validity of using old ESC data to classify their data, rather than sweeping these concerns into the black box of a prebuilt classifier/signature/whatever.

DarioS commented 4 months ago

Interesting. My goal is to label cells by <cell type, cell state> and then associate proportions to chemotherapy response. <cancer, cycling> <cancer, not cycling> <fibroblast mCAF, cycling> <fibroblast mCAF, not cycling> <fibroblast iCAF, cycling> <fibroblast iCAF, not cycling> I shall do single cell scoring using a different reference. My colleague uses Seurat's cc.genes.updated.2019. Fingers crossed.

LTLA commented 4 months ago

FWIW if you follow the trail of references, I think you will find that Seurat's classifier is based on HeLa data, with some indirect contribution from HEK data. HeLa is a pretty wild system IIRC, barely human at all; though it is pretty popular as a model "organism" for studying cell cycle regulation and mechanisms, so maybe it is still relevant for this purpose. Guess we'll never know, I've never seen any experimental validation of the cell cycle scores.

If you just want cycling/non-cycling, explicit phase assignment is overkill. In fact, I bet this won't even give you proper "non-cycling"; most methods won't have a G0 state in their training data, and I'd be surprised if G1 and G0 were transcriptionally identical. Rather, you may prefer some form of subclustering on each cell type, possibly using only the cell cycle genes, and then manually annotating each subcluster as "cycling" or not. This is the general approach we used for mass cytometry, though the beautiful separation between cycling/non-cycling cells was due to the IdU marker.

Or fancier: i.e., do a PCA on all cycling genes within the (sub)population of interest, reconstruct the rank-1 matrix, take the column sums to obtain a "cell cycle activity", and then test for differences in the distribution along this axis between conditions. You can check out the ScoreFeatureSet function in libscran for more details.

MarioniLab / scran

Cyclone Computational Efficiency #118