Closed patrickjdanaher closed 2 years ago
Another modification: possibly, we could use squared errors during the iteration stages to speed things up, then NB in the final classification step for optimal accuracy (if we even believe NB is better than squared errors).
A key point: we should use biased subsamples taken via the "geometric sketching" method that Zach Reitz implemented in Ptolemy. This will help us find rare cell types.
Implemented.
Background:
I found a nice paper using a similar subsampling approach as ours: https://escholarship.org/content/qt4k17m0n5/qt4k17m0n5_noSplash_4c9d302753657774cbb6b2b0cd4df989.pdf
Theirs is more refined: they have both "samples" of the data, which are big enough to adequately represent the whole dataset, and "subsamples", which are pretty small, and which they create many of. The "subsamples" let them explore initial conditions / different local maxima, and the "samples" are big enough for a full clustering solution.
In contrast, we have just 2 levels: "subsamples" of ~10000, and then the full-sized dataset.
They break it down this way:
So perhaps our basic iteration stages should be:
How to implement in the code: