refined subsampling approach

patrickjdanaher commented 3 years ago

Background:

I found a nice paper using a similar subsampling approach as ours: https://escholarship.org/content/qt4k17m0n5/qt4k17m0n5_noSplash_4c9d302753657774cbb6b2b0cd4df989.pdf

Theirs is more refined: they have both "samples" of the data, which are big enough to adequately represent the whole dataset, and "subsamples", which are pretty small, and which they create many of. The "subsamples" let them explore initial conditions / different local maxima, and the "samples" are big enough for a full clustering solution.

In contrast, we have just 2 levels: "subsamples" of ~10000, and then the full-sized dataset.

They break it down this way:

So perhaps our basic iteration stages should be:

use "subsamples" to choose initial conditions (1k cells, say 10 starts)
starting with the estimates from the best "subsample", use a "sample" (e.g. 10k cells) to make progress towards convergence
use a large (100K+) "supersample" to complete convergence.
use the estimates from the "supersample" to classify all cells. I.e. only run 1 iteration of the Mstep, none of the Estep.

How to implement in the code:

nbclust is still the basic underlying engine
cellEMclust then has 4 steps:
1. "exploring" using subsamples. Arguments: subsample_size, subsample_n, subsample_maxiter, subsample_convergence_criterion
2. "approaching" using a sample. Argument: midsample_size, midsample_maxiter, midsample_convergence_criterion
3. "refining" using a supersample. Argument: supersample_size, supersample_maxiter, supersample_convergence_criterion
4. "classifying" on full dataset

patrickjdanaher commented 3 years ago

Another modification: possibly, we could use squared errors during the iteration stages to speed things up, then NB in the final classification step for optimal accuracy (if we even believe NB is better than squared errors).

patrickjdanaher commented 3 years ago

A key point: we should use biased subsamples taken via the "geometric sketching" method that Zach Reitz implemented in Ptolemy. This will help us find rare cell types.

patrickjdanaher commented 2 years ago

Implemented.

Nanostring-Biostats / InSituType