refineClusters subclustering logic fails when applied to supervised results

Nanostring-Biostats / InSituType

An R package for performing cell typing in SMI and other single cell data

Other

22 stars 10 forks source link

refineClusters subclustering logic fails when applied to supervised results #161

Closed patrickjdanaher closed 1 year ago

patrickjdanaher commented 2 years ago

Here's what happens:

designated cell types are subclustered
new logliks are derived for all cells * the subclusters
these new logliks outperform the other cell types', which were based on reference profiles, not the count data
and tons of cells get reassigned to the subclusters.

patrickjdanaher commented 2 years ago

Basic solution: only sub-cluster the cell types in question; don't revisit the others. For omitted cells, just copy the logliks from the original cluster to its subclusters. E.g., a B-cell will get the same logliks for myeloid_1 and myeloid_2 as it had for "myeloid".

Big question: do we use the above logic for supervised only, or for unsupervised results as well?

patrickjdanaher commented 1 year ago

The dilemma:

For supervised / semi-supervised cell typing results, subclustering must only apply to the cell type in question; we can't let the subclusters grab cells that weren't in the original larger cluster.
But for unsupervised cell typing results, it'd be better to allow cells to go wherever they have the best loglikelihood.

Solutions:

Somehow track whether a result is supervised/supervised or unsupervised, then have subclustering act appropriately.
Just make the subclustering functionality confine itself to the selected cell type, and never reassign cells from other cluster to the new subclusters.

(2) seems easier, both to implement and explain.

davidpross commented 1 year ago

I think (2) makes the most sense, in fact I wouldn't have guessed sub-clustering grabbed cells from other types than the one being split.

But I don't fully get the explanation for why the sub-clustering grabs all of these other cells. Aren't the logliks updated based on the profiles generated from the actual count data in the end?

patrickjdanaher commented 1 year ago

More complex version: (possibly):

in refineClusters, have an option to first re-cell type using the out$profiles matrix.

patrickjdanaher commented 1 year ago

Implemented (2); will merge to ADO soon.