semi-supervised final version

patrickjdanaher commented 2 years ago

Basic question: do we use only anchors in defining profiles, or do we use all cells in the anchor clusters?

If the former, then save time by not computing logliks of anchor cells. This will hugely speed up early iters. Can then finally calc anchor logliks when running insitutypeML at the end.

Also, if we only use anchors to estimate profiles, then they can be totally omitted from the algorithm. And we're back to the old nbclust, where some profiles are just pre-defined and never updated! Do we really want that? Would be more predictable, but then you'd only be able to use cells with enough anchors to specify a profile, say 500. (Not that anchors work great when you only have 20 of them...)

I.e., are anchors just a better way to pre-define cell type profiles? I.e to update them for CosMx?

~~Read CallR paper from Wei and Zhang - does semi-sup in scRNAseq~~ Not very relevant.

patrickjdanaher commented 2 years ago

Findings / conclusions

"old" nbclust does much, much better with profiles derived from anchor cells than with the original fixedprofiles.
seems like we could use anchors to partially update fixed profiles in a principled way.

Proceed with workflow reverting to "old" nbclust with "fixedprofiles" argument.

patrickjdanaher commented 2 years ago

What's needed for the fixedprofiles -> anchor -> updatedprofiles workflow:

Function to choose anchors (exists)
Function to update fixedprofiles based on anchors (in progress)
Insitutype() should handle being passed reference_profiles, anchors, OR updated_profiles
in insitutype, anchors should be omitted from subsets until the final step (phase 4 / insitutypeML)
nbclust never updates the fixedprofiles it's passed

Terms:

"reference_profiles" = from scRNAseq, pre-defined
"updated_profiles" = what you get after including info from anchors
"fixed_profiles" = in nbclust, the columns of the profile matrix that never change

patrickjdanaher commented 2 years ago

Status:

Done with nbclust
Next: get cluster updating logic implemented in runInsitutype

patrickjdanaher commented 2 years ago

Ways to get fixedprofiles:

Put them in, and don't update
Start with anchors, then derive profiles (seems like you'd do this ahead of time)
Start with ref profiles, then get anchors automatically, then update ref profiles to get fixedprofiles

Need a function: update_reference_profiles.

would take in counts, neg, and optionally anchors
if no anchors, then would auto-select
then would call a subsidiary function: "update_ref_given_anchors"

Calling in wrapper fns:

runinsitutype would support the main use case where you just put in ref profiles and automatically update, OR the case where you're committed to fixed profiles.
logic would probably have to be duplicated in insitutype AND runinsitutype? Or would runinsitutype just

Functions:

update_profiles - either take pre-specified anchors or derive automatically

patrickjdanaher commented 2 years ago

Status: partway thru updating runinsitutype (line 118). Next: continue stripping references to anchors, add in fixed_profiles references when needed

patrickjdanaher commented 2 years ago

[x] chooseClusterNumber: use fixedprofiles and not anchors
[x] what do we do with the anchor correction step in runinsituType?
[x] need to align counts and fixed_profiles in runinsitutype (see comment in line 77)
[x] ~~remove anchors from sketchingdata~~ (too complex, and only a tiny efficiency gain)
[x] should report anchors if update_reference_profiles gets run
[x] there's a bug if n_clusts = 0: "Error in profiles[, setdiff(colnames(profiles), colnames(fixed_profiles)), : incorrect number of dimensions"
[x] Now ready to test runinsitutype with update_reference_profiles = FALSE
[x] Then need to finalize the update_reference_profiles = TRUE code

patrickjdanaher commented 2 years ago

Next: test that the new version works in multiple datasets. Finding: some stealing from anchor clusters still occurs. It's a somewhat small effect, but still obviously wrong.

Next: how do you slightly update the anchor-based profiles without letting them run away? Solution to try:

[x] new argument: update_reference_profiles_iteratively. if TRUE, then set fixed_profiles = NULL in phase 2 or 3 or both.
[x] then test again in showcase and Ting That also failed.

patrickjdanaher commented 2 years ago

Best guess at how to optimize semi-sup: make anchor selection less biased towards the cells with the most extreme agreements with the reference. E.g. just take all cells meeting criteria. (But that alone won't accomplish it.)

Nanostring-Biostats / InSituType

semi-supervised final version #142

Proceed with workflow reverting to "old" nbclust with "fixedprofiles" argument.