likelihood calcualtion tLLRv2

dan11mcguire commented 2 years ago

@lidanwu @davidpross @patrickjdanaher

Wondering about a few details in FastReseg. It seems like the goal is to get a n_cells x n_clusters matrix of cell-level loglikelihood tLLRv2_cellMatrix <- counts %*% tLLRv2_geneMatrix

But it looks like scoreGenesInRef function seems to calculate the log of the mean profile, instead of a loglikelihood, in producing tLLRv2_geneMatrix.

libsize <- colSums(ref_profiles, na.rm = TRUE) norm_exprs <- Matrix::t(Matrix::t(ref_profiles)/ libsize) _loglik <- log(norm_exprs)_ return(loglik)

I think with the above formula, discrepancies with the original nb_clust derived celltypes and tllRv2_maxCelltype could potentially occur for a couple of reasons

likelihood is not additive by count. i.e., loglik(count=2, mu) is not equal to 2*loglik(count=1, mu)
counts is not normalized by library size like tLLRv2_geneMatrix is.
background not modeled.

If you assume the neg binomial model, maybe this would work below At the least I wonder whether there should be anything but minimum discrepancy between the original cell_type labels from nb_clust, and the celltype from tLLRv2_maxCellType, if the only major difference is background.

(similar to nbclust (R/nbclust.R)) I think Dave R. may have an optimized implementation. ` tLLRv2_cellMatrix_v2 <- apply(refProfiles, 2, function(ref){ libsize <- Matrix::rowSums(counts) yhat <- matrix(libsizes,ncol=1) %*% matrix(ref,nrow=1)

yhat is a n_cells x n_genes matrix of expected values

lls <- stats::dnbinom(x = as.matrix(counts)
                      ,size = size
                      ,mu = yhat
                      ,log = TRUE)
return(rowSums(lls))

}) `

dan11mcguire commented 2 years ago

Also here, in score_transcripts.R.

transcriptGeneScore has one row per transcript. existing looks like below, but we are just adding up the log means in reference profile. `

tmp_score <- transcriptGeneScore[, lapply(.SD, sum), .SDcols = colnames(score_GeneMatrix), by = cellID_coln] `

vs. something like this which would add up the likelihood for the 1-transcript row. (assuming that score_GeneMatrix has not been max-centered already.
The problem with either of these is that we are ignoring 0's.
i.e., the loglikelihood of a 0 count is not 0 tmp_score <- transcriptGeneScore[, lapply(.SD, function(x){ dnbinom(1, mu=exp(x)*libsize,size=10) }) ,.SDcols = colnames(score_GeneMatrix), by = cellID_coln]

dan11mcguire commented 2 years ago

It seems like the end goal is to

get the 1count-transcript level likelihood for each count given the called celltype, i.e., score_df.
test association between transcript location and likelihood of the transcript.

But I'm thinking this may be better done using an actual likelihood, as current score can be a noisier kind of approximation, that seems maybe unintentional. You could calculate the likelihood using dnbinom, and still run the same test on association with spatial location.

If segmentation errors are more likely to occur near the edge of the cell boundary, would it make sense to test association with distance from cell centroid?

@lidanwu @davidpross @patrickjdanaher @dan11mcguire

lidanwu commented 2 years ago

Thank you, @dan11mcguire, for catching this. You’re right that the tLLRv2 score is not actual log-likelihood. The simple score sum of all transcripts is not the log-likelihood of a cell belonging to certain cell type, but only a quick approximation for the correct log likelihood calculation in order to assign a cell type to individual transcript group being split from the original cell segmentation. Your suggestions are better in assigning cell types at cell level and I probably should do that when it comes to cell typing.

tLLRv2 is meant to be calculated this way.

It’s the adjusted score for the log-likelihood of a given gene belong to a particular cell type, normalized by subtracting the maximum score of all cell types. The intention is to get a metric for evaluating how unlikely the particular gene belongs to certain cell type. So it’s biased towards to detect genes with negative depletion under current cell type. Since some transcripts might be due to non-specific binding (like the negative probe background), we decide to only flag cells with strong spatial dependency in their tLLRv2 score distribution under the assumption that low score due to cell segmentation would have spatial pattern (local enrichment of low score transcripts) while non-specific binging would be sporadic without spatial pattern.

There are quite many cells overlapping in z direction with various degrees but our cell segmentation borders are only in 2D. Distance to existing 2D borders are not necessarily associated with likelihood of segmentation error. Here is an example of segmentation error due to z-overlay.

lidanwu commented 2 years ago

Here is a list of functions/steps that might be beneficial from using the log-likelihood of entire profiles for cell typing.

assign cell type based on input reference profiles for original cell segmentation https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/wrapper_FastReseg.R#L278-L290
assign cell type to individual transcript groups identified by SVM for each flagged cells https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/wrapper_FastReseg.R#L536-L542
evaluate neighborhood of each transcript groups, check the log-likelihood of current transcript groups under the cell type of neighbor cells and select the cell & cell type with maximum log-likelihood https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/neighborhood_spatstat.R#L375-L406
assign cell type to the updated cell segmentation https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/wrapper_FastReseg.R#L631-L635

Challenges: how to apply cutoff to say whether the final cell typing is good enough? Is the cutoff affected by transcript number? Current way of calculating baseline is the sum of tLLRv2 score. https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/get_baseline.R#L235-L282 The cutoff is used here. https://github.com/Nanostring-Biostats/FastReseg/blob/llrecalc/R/wrapper_FastReseg.R#L578

lidanwu commented 5 months ago

implemented in PR #18. The impact on results needs review.

Nanostring-Biostats / FastReseg

likelihood calcualtion tLLRv2 #16

yhat is a n_cells x n_genes matrix of expected values