Nanostring-Biostats / InSituType

An R package for performing cell typing in SMI and other single cell data
Other
26 stars 10 forks source link

incorporating spatial information (future option) #132

Closed patrickjdanaher closed 2 years ago

patrickjdanaher commented 2 years ago

Spatial context has information to offer cell typing. (Though often it's better to ignore this information if downstream you'll be running DE vs. spatial variables.)

Proposal to incorporate spatial information:

  1. Run cell typing (or just the Mstep)
  2. Calculate a "neighborhood matrix" describing the cell type abundance in the vicinity of each cell
  3. Use this neighborhood matrix to group similar cells together. Mclust would be very fast, but something continuous like euclidean distance would be better. Perhaps score by the distance from each cluster's centroid or the loglik under each cluster?
  4. For each cell, use its neighborhood information to calculate the prior probability of it being each cell type. If (3) just outputs clusters, then the prior probability is the cell type frequencies in the cluster. If (3) outputs continuous distances, then use them to derive a weighted average of cell type frequencies.
  5. Threshold these prior probabilities below at 1e-3 or 1e-2 so they don't rule out any cell types. The threshold can be lower for big bins, but should be higher in small bins where cell type frequencies are noisy. Alternatively, just add 10 counts to all cluster frequencies within each bin. Even better, average the cluster freqs with the base rates. Or add 1000 cells' worth of counts from the baseline frequencies.

The impact of the above will be modest. It won't reshape the whole cell typing outcome, but it will act as a tiebreaker for more ambiguous cells.

patrickjdanaher commented 2 years ago

Note that this approach could also allow for incorporation of immunofluorescence info.

patrickjdanaher commented 2 years ago

Basic approach:

  1. Helper functions to quickly split cells into "cohorts" based on info outside of RNA, e.g. neighborhood info and immunofluorescence.
  2. When clustering, compute cell logliks as their rna-based loglik + the log cluster frequency within their cohort.

Function 1: "fastNeighborhoodFactorization":

Function 2: "fastCohorting"

Function 3: "update_logliks_with_cohort_freqs"

patrickjdanaher commented 2 years ago

Implemented in the iss132-cohorting branch. Not yet merged.

patrickjdanaher commented 2 years ago

merged