Nanostring-Biostats / FastReseg

FastReseg for detection and correction of cell segmentation error based on spatial profile of transcripts.
https://nanostring-biostats.github.io/FastReseg/
Other
3 stars 0 forks source link

likelihood calcualtion tLLRv2 #16

Open dan11mcguire opened 2 years ago

dan11mcguire commented 2 years ago

@lidanwu @davidpross @patrickjdanaher

Wondering about a few details in FastReseg. It seems like the goal is to get a n_cells x n_clusters matrix of cell-level loglikelihood tLLRv2_cellMatrix <- counts %*% tLLRv2_geneMatrix

But it looks like scoreGenesInRef function seems to calculate the log of the mean profile, instead of a loglikelihood, in producing tLLRv2_geneMatrix.

libsize <- colSums(ref_profiles, na.rm = TRUE) norm_exprs <- Matrix::t(Matrix::t(ref_profiles)/ libsize) _loglik <- log(norm_exprs)_ return(loglik)

I think with the above formula, discrepancies with the original nb_clust derived celltypes and tllRv2_maxCelltype could potentially occur for a couple of reasons

  1. likelihood is not additive by count. i.e., loglik(count=2, mu) is not equal to 2*loglik(count=1, mu)

  2. counts is not normalized by library size like tLLRv2_geneMatrix is.

  3. background not modeled.

If you assume the neg binomial model, maybe this would work below At the least I wonder whether there should be anything but minimum discrepancy between the original cell_type labels from nb_clust, and the celltype from tLLRv2_maxCellType, if the only major difference is background.

(similar to nbclust (R/nbclust.R)) I think Dave R. may have an optimized implementation. ` tLLRv2_cellMatrix_v2 <- apply(refProfiles, 2, function(ref){ libsize <- Matrix::rowSums(counts) yhat <- matrix(libsizes,ncol=1) %*% matrix(ref,nrow=1)

yhat is a n_cells x n_genes matrix of expected values

lls <- stats::dnbinom(x = as.matrix(counts)
                      ,size = size
                      ,mu = yhat
                      ,log = TRUE)
return(rowSums(lls))

}) `

dan11mcguire commented 2 years ago

Also here, in score_transcripts.R.

transcriptGeneScore has one row per transcript. existing looks like below, but we are just adding up the log means in reference profile. `

tmp_score <- transcriptGeneScore[, lapply(.SD, sum), .SDcols = colnames(score_GeneMatrix), by = cellID_coln] `

vs. something like this which would add up the likelihood for the 1-transcript row. (assuming that score_GeneMatrix has not been max-centered already.
The problem with either of these is that we are ignoring 0's.
i.e., the loglikelihood of a 0 count is not 0 tmp_score <- transcriptGeneScore[, lapply(.SD, function(x){ dnbinom(1, mu=exp(x)*libsize,size=10) }) ,.SDcols = colnames(score_GeneMatrix), by = cellID_coln]

dan11mcguire commented 2 years ago

It seems like the end goal is to

  1. get the 1count-transcript level likelihood for each count given the called celltype, i.e., score_df.
  2. test association between transcript location and likelihood of the transcript.

image

But I'm thinking this may be better done using an actual likelihood, as current score can be a noisier kind of approximation, that seems maybe unintentional. You could calculate the likelihood using dnbinom, and still run the same test on association with spatial location.

image

If segmentation errors are more likely to occur near the edge of the cell boundary, would it make sense to test association with distance from cell centroid?

@lidanwu @davidpross @patrickjdanaher @dan11mcguire

lidanwu commented 2 years ago

Thank you, @dan11mcguire, for catching this. You’re right that the tLLRv2 score is not actual log-likelihood. The simple score sum of all transcripts is not the log-likelihood of a cell belonging to certain cell type, but only a quick approximation for the correct log likelihood calculation in order to assign a cell type to individual transcript group being split from the original cell segmentation. Your suggestions are better in assigning cell types at cell level and I probably should do that when it comes to cell typing.

tLLRv2 is meant to be calculated this way. image

It’s the adjusted score for the log-likelihood of a given gene belong to a particular cell type, normalized by subtracting the maximum score of all cell types. The intention is to get a metric for evaluating how unlikely the particular gene belongs to certain cell type. So it’s biased towards to detect genes with negative depletion under current cell type. Since some transcripts might be due to non-specific binding (like the negative probe background), we decide to only flag cells with strong spatial dependency in their tLLRv2 score distribution under the assumption that low score due to cell segmentation would have spatial pattern (local enrichment of low score transcripts) while non-specific binging would be sporadic without spatial pattern.

There are quite many cells overlapping in z direction with various degrees but our cell segmentation borders are only in 2D. Distance to existing 2D borders are not necessarily associated with likelihood of segmentation error. Here is an example of segmentation error due to z-overlay. image

lidanwu commented 2 years ago

Here is a list of functions/steps that might be beneficial from using the log-likelihood of entire profiles for cell typing.

Challenges: how to apply cutoff to say whether the final cell typing is good enough? Is the cutoff affected by transcript number? Current way of calculating baseline is the sum of tLLRv2 score. https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/get_baseline.R#L235-L282 The cutoff is used here. https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/wrapper_FastReseg.R#L578

lidanwu commented 5 months ago

implemented in PR #18. The impact on results needs review.