MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

Returning in findMarkers() the t-statistics (or other pairwise statistics calculated) #57

Closed lcolladotor closed 4 years ago

lcolladotor commented 4 years ago

Hi Aaron,

For a project with @mattntran, we want to run findMarkers() and retain all the t-statistics it calculates. Matt currently has a line of code like this:

findMarkers(sce.hpc, groups=sce.hpc$contrast,
    assay.type="logcounts", design=mod, test="t",
    direction="up", pval.type="all", full.stats=TRUE
)

Current code

I see that pairwiseTTests() at pairwiseTTests.R#L271 computes the information it needs for calculating the t-statistics using .get_t_test_stats() at pairwiseTTests.R#L303-L324. Then that information is used for computing the t-statistics and the p-values at pairwiseTTests.R#L276 using .run_t_test() pairwiseTTests.R#L431-L466 that itself calculates the t-statistic (cur.t) depending on some input options. For example, if thresh.lfc is 0, then it does it by dividing the log fold change by the square root of the error pairwiseTTests.R#L437. Currently, .run_t_test() doesn't return the cur.t value at the end at pairwiseTTests.R#L465.

Option 1: Currently compute t-statistic

One option currently for us, would be to use the p-value to compute the t-statistic.

Since we know that thresh.lfc option we are using and because we used direction = "up" (so pairwiseTTests.R#L447-L454 is not run), we could use stats::qt() to get the t-statistic from the p-value. We would need to run the internal function .get_t_test_stats() pairwiseTTests.R#L303-L324 to obtain the degrees of freedom.

Or maybe we missed a function in scran() that does this already.

Option 2: edit scran to support returning the statistics

Another option would be to edit scran such that .run_t_test() returns the cur.t at pairwiseTTests.R#L465, then also return it at STATFUN at pairwiseTTests.R#L285-L294 then potentially also edit .pairwise_blocked_template() utils_markers.R#L38-L51 (you know what would need to be edited :P).

But then, maybe you'd also want to return the corresponding statistic for the Wilcoxon tests or other tests supported by findMarkers() when full.stats = TRUE.

Let us know how you think it would be best to proceed. For example, I can try submitting a PR about the t-stats.

Best, Leo

LTLA commented 4 years ago

I would not be opposed to pairwiseTTests() emitting the t-statistic in its various DataFrames. But once it goes into combineMarkers(), there is no single statistic anymore; it all gets aggregated together, depending on the choice of aggregation method specified by pval.type=.

LTLA commented 4 years ago

I forgot that, even within pairwiseTTests(), p-values are aggregated across blocks, so again there isn't necessarily a single t-statistic in those cases. I don't particularly fancy having a function that sometimes reports t-statistics and sometimes doesn't; that behavior seems too difficult to me.

Why do you need the t-statistics anyway? You can get very close with the standardized log-fold change; for any use of pairwiseTTests() with design=, Cohen's D only differs from a t-statistic by a constant scaling factor, so should be similarly applicable in any situation that I can imagine.

lcolladotor commented 4 years ago

Hi Aaron,

Ahh yeah, I forgot about how the p-values are combined. Just to expand a bit, @mattntran wants the pairwise t-stats to then compare them across datasets like we do at http://research.libd.org/spatialLIBD/reference/layer_stat_cor.html in order to make plots like in http://research.libd.org/spatialLIBD/reference/layer_stat_cor_plot.html.

However, instead of computing these stats at the pseudo-bulk level like we did, he wants to compute them at the single nucleus level for this snRNA-seq data. That's why he is using scran::findMarkers().

Regarding the aggregation methods controlled by pval.type, maybe when scran chooses the min or max p-value, scran could report the corresponding t-statistic for that test. As for t-statistics across blocks, hm..., maybe it would need to be a data frame with 1 column per block. I recognize that this sounds a bit complicated for supporting all the uses cases (which again, I wasn't thinking about before, oops). So it's quite a bit of work.

Given the complexity of expanding this for the general use case, maybe Matt and I just need to hack our way through the code to get the pairwise t-stats and close this issue. Or like you said Aaron, use the standarized log-fold change (Matt, is there a reason I'm missing as to why the correlation works best using the t-stats? Hm... I guess that the constant scaling factor (variance) can be really different between data sets.)

Best, Leo

lcolladotor commented 4 years ago

Hi Aaron,

Looking in more detail, maybe we don't need to hack anything and simply use std.lfc = TRUE (@mattntran pointed out this argument to me and I see now documented at https://github.com/MarioniLab/scran/blob/cbc535d433d9168707e7db025f16d09b6ae21852/R/pairwiseTTests.R#L73-L76). I see at that when lfc is 0 (the default), the t stat is calculated using https://github.com/MarioniLab/scran/blob/cbc535d433d9168707e7db025f16d09b6ae21852/R/pairwiseTTests.R#L437

I see that .fit_lm_internal() is used when the design is specified https://github.com/MarioniLab/scran/blob/cbc535d433d9168707e7db025f16d09b6ae21852/R/pairwiseTTests.R#L184-L187 (earlier I looked at .test_block_internal()). When std.lfc = TRUE, I see at https://github.com/MarioniLab/scran/blob/cbc535d433d9168707e7db025f16d09b6ae21852/R/pairwiseTTests.R#L405-L408 that Cohen's D is calculated which is similar to https://github.com/MarioniLab/scran/blob/cbc535d433d9168707e7db025f16d09b6ae21852/R/pairwiseTTests.R#L437. I just re-read https://en.wikipedia.org/wiki/Effect_size#Cohen's_d and I see that d = t / sqrt(N).

Earlier I didn't realize this and was thinking that regular logFC are not scaled by their variance and hence comparing them across datasets is hard. However, with Cohen's d we could do that.

Am I on the right track? As in use std.lfc = TRUE then either just correlate the standardized LFC (Cohen's d) across datasets and/or multiply Cohen's d by sqrt(N) without having to hack anything.

Best, Leo

LTLA commented 4 years ago

Yes, that sounds right. I would say that Cohen's D is even better for comparison across datasets, as it is not affected by whether one dataset has more or fewer cells than the other.

lcolladotor commented 4 years ago

Ahh, nice, thanks Aaron!

mattntran commented 4 years ago

Thanks for the feedback and insight, Aaron! I've started just running iterations with and without std.lfc = TRUE so we can have both observations to reference, and since the test runs really quickly 👍