incorporating spatial information (future option)

patrickjdanaher commented 2 years ago

Spatial context has information to offer cell typing. (Though often it's better to ignore this information if downstream you'll be running DE vs. spatial variables.)

Proposal to incorporate spatial information:

Run cell typing (or just the Mstep)
Calculate a "neighborhood matrix" describing the cell type abundance in the vicinity of each cell
Use this neighborhood matrix to group similar cells together. Mclust would be very fast, but something continuous like euclidean distance would be better. Perhaps score by the distance from each cluster's centroid or the loglik under each cluster?
For each cell, use its neighborhood information to calculate the prior probability of it being each cell type. If (3) just outputs clusters, then the prior probability is the cell type frequencies in the cluster. If (3) outputs continuous distances, then use them to derive a weighted average of cell type frequencies.
Threshold these prior probabilities below at 1e-3 or 1e-2 so they don't rule out any cell types. The threshold can be lower for big bins, but should be higher in small bins where cell type frequencies are noisy. Alternatively, just add 10 counts to all cluster frequencies within each bin. ~~Even better, average the cluster freqs with the base rates.~~ Or add 1000 cells' worth of counts from the baseline frequencies.

The impact of the above will be modest. It won't reshape the whole cell typing outcome, but it will act as a tiebreaker for more ambiguous cells.

patrickjdanaher commented 2 years ago

Note that this approach could also allow for incorporation of immunofluorescence info.

patrickjdanaher commented 2 years ago

Basic approach:

Helper functions to quickly split cells into "cohorts" based on info outside of RNA, e.g. neighborhood info and immunofluorescence.
When clustering, compute cell logliks as their rna-based loglik + the log cluster frequency within their cohort.

Function 1: "fastNeighborhoodFactorization":

Calculate a matrix of nearest neighbors in xy space, or just take it as input
Calculate or take in PCA dim reduction of the gene expression matrix (do this to save memory usage vs 1000 genes)
Calculate the average PCA position of each cell's neighbors. This produces a ~20 column "neighborhood matrix"
Run PCA on this matrix to further dim-reduce it, probably to 4 dimensions

Function 2: "fastCohorting"

Take in a matrix of relevant data. Main use case: cbind(the above neighborhood info plus the immunofluorescence columns)
Quickly cluster it, either with mClust or sketching
Return cluster results, called "cohort"

Function 3: "update_logliks_with_cohort_freqs"

take in a per-cell logliks matrix, the vector of cluster assignments, and the vector of cohort assigments
for each cohort: -- Get the frequency of each cell type within the cohort -- Use Bayesian math to shrink those frequencies with the baseline cell type frequencies. Also lower threshold at 1e-3 or so. -- For all cells in the cohort, sweep(logliks[thesecells, ], 2, log(cohortfrequencies), "+")

patrickjdanaher commented 2 years ago

Implemented in the iss132-cohorting branch. Not yet merged.

patrickjdanaher commented 2 years ago

merged

Nanostring-Biostats / InSituType

incorporating spatial information (future option) #132