Nanostring-Biostats / InSituType

An R package for performing cell typing in SMI and other single cell data
Other
29 stars 11 forks source link

refined subsampling approach #22

Closed patrickjdanaher closed 2 years ago

patrickjdanaher commented 3 years ago

Background:

I found a nice paper using a similar subsampling approach as ours: https://escholarship.org/content/qt4k17m0n5/qt4k17m0n5_noSplash_4c9d302753657774cbb6b2b0cd4df989.pdf

Theirs is more refined: they have both "samples" of the data, which are big enough to adequately represent the whole dataset, and "subsamples", which are pretty small, and which they create many of. The "subsamples" let them explore initial conditions / different local maxima, and the "samples" are big enough for a full clustering solution.

In contrast, we have just 2 levels: "subsamples" of ~10000, and then the full-sized dataset.

They break it down this way: image

So perhaps our basic iteration stages should be:

  1. use "subsamples" to choose initial conditions (1k cells, say 10 starts)
  2. starting with the estimates from the best "subsample", use a "sample" (e.g. 10k cells) to make progress towards convergence
  3. use a large (100K+) "supersample" to complete convergence.
  4. use the estimates from the "supersample" to classify all cells. I.e. only run 1 iteration of the Mstep, none of the Estep.

How to implement in the code:

patrickjdanaher commented 3 years ago

Another modification: possibly, we could use squared errors during the iteration stages to speed things up, then NB in the final classification step for optimal accuracy (if we even believe NB is better than squared errors).

patrickjdanaher commented 3 years ago

A key point: we should use biased subsamples taken via the "geometric sketching" method that Zach Reitz implemented in Ptolemy. This will help us find rare cell types.

patrickjdanaher commented 2 years ago

Implemented.