MangiolaLaboratory / sccomp

Testing differences in cell type proportions from single-cell data.
https://stemangiola.github.io/sccomp/
90 stars 7 forks source link

Question about "test_composition_above_logit_fold_change" parameter in sccomp_glm #99

Closed Winbuntu closed 11 months ago

Winbuntu commented 11 months ago

Hi sccomp team,

I got a few questions about the "test_composition_above_logit_fold_change" parameter in sccomp_glm() and I am wondering if I can get some inputs from you:

stemangiola commented 11 months ago

Hello @Winbuntu,

I should change that to test_composition_above_inv_softmax_fold_change A bit of a mouthful. But softmax is the space the parameter is estimated into.

Find the definition of softmax

logsumexp <- function (x) {
    y = max(x)
    y + log(sum(exp(x - y)))
}

softmax <- function (x) {
    if(sum(x)!=0) stop("The uncontrained vector should sum to zero. To keep the same degrees of freedom (i.e. length(x)-1) of the proportion vector, which sum to 1.")
    exp(x - logsumexp(x))
}

inverse_softmax <- function(p) {
    values <- log(p)
    values - mean(values)
}

The reason you cannot define a threshold on proportions, is that a change between 0.4, and 0.6, is much easier than between 0.9 and 0.99, in other words proportions are not linear.

As per the method TREAT (https://rdrr.io/bioc/edgeR/man/glmTreat.html) the filtering strategy is not advisable. You should test against a threshold.

Even more importantly, I will add it to the method, you cannot set a threshold to 0, because the Basyesian FDR needs a H0 null, non-zero probability.

Commit: https://github.com/stemangiola/sccomp/commit/79829a5f10e19bc75a305a86cba2d85a46b1de13

Winbuntu commented 11 months ago

Thanks for the reply! Based on that you suggest, I will set the threshold = 0.2 as by default, and report the c_FDR together with proportion of cells in each group. A few additional question I have are:

stemangiola commented 11 months ago
  • Since we are testing multiple cell types simutaneously, I think multiple testing correctio is needed. Has c_FDR already adjusted for multi-testing?

FDR is the statistics you want. For such highly hierarchical and constrained models, multiple test correction is not needed (there literature about this), as the noise is very well modelled.

  • In general modeling composition is my primary interest, any inputs about what formula should I put into formula_variability? Shall I use "~ 1", or I should put in the same fomula as I have for fomula_composition?

Yes you can leave ~1 It assumes same variability for all