Question about "test_composition_above_logit_fold_change" parameter in sccomp_glm

Winbuntu commented 11 months ago

Hi sccomp team,

I got a few questions about the "test_composition_above_logit_fold_change" parameter in sccomp_glm() and I am wondering if I can get some inputs from you:

I believe this parameter represent the "fold-change threshold (0.2 by default)" as described in Method section of the PNAS paper. I think the "fold-change" referred here is in fact the foldchange of logit-transformed cell type porportion, but not the origional cell tyoe porportion value, which are bounded to [0, 1]?
I hope to identify differential abudnant cell types between two groups of samples, by looking for the cells that are 1) statisitcally significant, and 2) has a porportion fold-change bigger than 2. My understanding is that test_composition_above_logit_fold_change does not represent foldchnage of cell type porportion (which is I want), but logit-transformed porportion instead. So my ad hoc solution is first set test_composition_above_logit_fold_change = 0, then manually calculare fold change between two groups of porportion of each cell type, and select the cell types that have c_FDR < 0.05 and foldchange >2. Does this make sense? Or I should use a differnet approach?

stemangiola commented 11 months ago

Hello @Winbuntu,

I should change that to test_composition_above_inv_softmax_fold_change A bit of a mouthful. But softmax is the space the parameter is estimated into.

Find the definition of softmax

logsumexp <- function (x) {
    y = max(x)
    y + log(sum(exp(x - y)))
}

softmax <- function (x) {
    if(sum(x)!=0) stop("The uncontrained vector should sum to zero. To keep the same degrees of freedom (i.e. length(x)-1) of the proportion vector, which sum to 1.")
    exp(x - logsumexp(x))
}

inverse_softmax <- function(p) {
    values <- log(p)
    values - mean(values)
}

The reason you cannot define a threshold on proportions, is that a change between 0.4, and 0.6, is much easier than between 0.9 and 0.99, in other words proportions are not linear.

As per the method TREAT (https://rdrr.io/bioc/edgeR/man/glmTreat.html) the filtering strategy is not advisable. You should test against a threshold.

Even more importantly, I will add it to the method, you cannot set a threshold to 0, because the Basyesian FDR needs a H0 null, non-zero probability.

Commit: https://github.com/stemangiola/sccomp/commit/79829a5f10e19bc75a305a86cba2d85a46b1de13

Winbuntu commented 11 months ago

Thanks for the reply! Based on that you suggest, I will set the threshold = 0.2 as by default, and report the c_FDR together with proportion of cells in each group. A few additional question I have are:

Since we are testing multiple cell types simutaneously, I think multiple testing correctio is needed. Has c_FDR already adjusted for multi-testing?
In general modeling composition is my primary interest, any inputs about what formula should I put into formula_variability? Shall I use "~ 1", or I should put in the same fomula as I have for fomula_composition?

stemangiola commented 11 months ago

Since we are testing multiple cell types simutaneously, I think multiple testing correctio is needed. Has c_FDR already adjusted for multi-testing?

FDR is the statistics you want. For such highly hierarchical and constrained models, multiple test correction is not needed (there literature about this), as the noise is very well modelled.

In general modeling composition is my primary interest, any inputs about what formula should I put into formula_variability? Shall I use "~ 1", or I should put in the same fomula as I have for fomula_composition?

Yes you can leave ~1 It assumes same variability for all

MangiolaLaboratory / sccomp

Question about "test_composition_above_logit_fold_change" parameter in sccomp_glm #99