Support for passing hypothesis to doLRT in summary() as in lrTest()

This is a (hopefully) minor feature request.

I am using the development version of the package. lrTest supports passing numerous different types of objects for the argument hypothesis, according to the document: "hypothesis can be one of a character giving complete factors or terms to be dropped from the model, CoefficientHypothesis giving names of coefficients to be dropped, Hypothesis giving contrasts using the symbolically, or a contrast matrix." However, when called via summary(zlmfit, doLRT), the doLRT argument can only be a character vector of coefficients to drop in the reduced model. It would be very helpful to also support the multiple argument types for doLRT in summary as those for hypothesis in lrTest.

My use case is as follows: I was exploring a series of somewhat complex nested design models, and I was not confident that I got the design matrix correct every time. After considering the experimental designs and questions of interests, I decided to instead create a new factor for combinations of all levels of the factors under consideration, provide an intercept-free model including only this new factor to zlm, and then obtain log fold-change values for a customized comparison via getLogFC(zlmfit, contrast0, contrast1) specifying both the contrast0 and contrast1 arguments. I was able to include such customized logFC in summary via summary(zlmfit, logFC=getLogFC(zlmfit, contrast0, contrast1)), and I also wanted to perform LR test to get P values exactly corresponding to the customized comparison. But with the current doLRT argument only accepting a character vector of coefficients I was not able to specify the test I needed.

Thanks for your help.

While you are welcome to submit a PR, it's not going to be a priority for me to extend summary in this way.

There's a coupling between the lrTest and the log fold changes that is complicated to get correct here,
I kind of hate the design of summary, so don't want to invest time in extending it. At some point, I envision rewriting it to resemble the DESeq2::results method, which puts everything in a "wider" format, maybe using nested tibbles to handle the discrete/continous components.

Workaround is to just call lrTest with your coefficients and combine by hand.

@amcdavid Thanks for the quick reply!

I do realize that this can be complicated as I further explored the manual combine approach. I would like to ask a little more in this line -- just want to make sure I get things correct when I implement my own workaround. I noticed that logFC in MAST is kind of unusual in that it requires two contrasts and will return the comparison of contrast1 to contrast0, while for lrTest I would pass something that corresponds to contrast1-contrast0. Generally in a linear model, such as in limma, we can also obtain a desired logFC with just a single contrast. I believe for MAST it's different due to the fact that logFC is defined as u(contrast1)v(contrast1) - u(contrast0)v(contrast0), however, it seems that this formula doesn't make sense for cases e.g. the desired logFC is a mean of multiple logFC's from a series of replicated pairwise comparisons. I wonder whether an alternative definition would make more sense for these cases, e.g. for intercept-free models where raw coefficients correspond to group means, u(coefficients)v(coefficients) %*% contrast for a single contrast (which corresponds to what I would pass to lrTest). Sorry I have not provided a concrete example, hopefully the point is clear. Could you kindly comment further wrt this?

Thanks again for providing and wonderfully maintaining this useful package. If I will ever manage to get a confident understanding of the inner workings of the package I'll be glad to try a PR:)

However, it seems that this formula doesn't make sense for cases e.g. the desired logFC is a mean of multiple logFC's from a series of replicated pairwise comparisons.

I probably need an example here to understand what you are referring to.

I wonder whether an alternative definition would make more sense for these cases, e.g. for intercept-free models where raw coefficients correspond to group means, u(coefficients)v(coefficients) %*% contrast for a single contrast (which corresponds to what I would pass to lrTest).

What do you mean by coefficients vs contrast? You can select a coefficient with a contrast that is just an indicator vector.

I'll try to explain with a toy example. For a "real" single-cell RNA-seq dataset there will be of course a lot more samples (cells).

First consider a simple one-way design of a factor named Group with the two levels "Control" and "Treated", the aim is to compare the "Treated" group to the "Control" group.

 #>     Group
 #> 1 Control
 #> 2 Control
 #> 3 Control
 #> 4 Treated
 #> 5 Treated
 #> 6 Treated

model.matrix with formula ~Group would give:

#>   (Intercept) GroupTreated
#> 1           1            0
#> 2           1            0
#> 3           1            0
#> 4           1            1
#> 5           1            1
#> 6           1            1

And in MAST::logFC contrast0 would be the following, corresponding to the "Control" group:

#>             (Intercept) GroupTreated
#> (Intercept)           1            0

contrast1 would be the following, corresponding to the "Treated" group:

#>              (Intercept) GroupTreated
#> GroupTreated           1            1

u(contrast0)v(contrast0) would be interpreted as the expected expression level (on the log scale) of the "Control" group, similarly u(contrast1)v(contrast1) would be that for the "Treated" group, thus u(contrast1)v(contrast1)-u(contrast0)v(contrast0) would correspond to the logFC between "Treated" and "Control". For this case everything is fine.

But then consider, a repeated measure design involving paired "Baseline" (i.e. control) and "Treated" groups of the same individual (i.e. paired data for each individual). The aim is to compare the "Treated" group to the "Baseline" group across the individuals. One may be interested in a effect size measure something like ( (Treated1 - Baseline1) + (Treated2 - Baseline2) + (Treated3 - Baseline3) )/3, in general this would be a (possibly weighted) average of logFC values for paired comparisons.

#>   Individual    Group
#> 1          1 Baseline
#> 2          1  Treated
#> 3          2 Baseline
#> 4          2  Treated
#> 5          3 Baseline
#> 6          3  Treated

The issue with this case is that for MAST::logFC we have to specify two contrasts corresponding to two groups to be compared. In the above case one could specify contrast1 that corresponds to (T1 + T2 + T3)/3 (T for Treated), and contrast0 that corresponds to (B1 + B2 + B3)/3 (B for Baseline). Normally this should not be an issue, as (T1 + T2 + T3)/3 - (B1 + B2 + B3)/3 would be equivalent to what we want. But in MAST, there are the continuous and discrete components that are being multiplied, and the results should be not equivalent in general. In other words, u((T1+T2+T3)/3)v((T1+T2+T3)/3) - u((B1+B2+B3)/3)v((B1+B2+B3)/3) is not the same as ( (u(T1)v(T1) - u(B1)v(B1)) + (u(T2)v(T2) - u(B2)v(B2)) + (u(T3)v(T3) - u(B3)v(B3)) )/3. Naively, it appears to me that the latter has a more meaningful interpretation considering the meanings of u(.) and v(.) and u(.)v(.)

Given this context, when I said u(coefficients)v(coefficients) %*% contrast, I actually meant ( coefC * 1/(1+exp(-coefD)) ) %*% contrast (sorry my previous notation was wrong). coefC and coefD would be the coefficients obtained when using an intercept-free model, such that they correspond to the estimated value for each "actually-defined" group (T1, T2, T3, B1, B2, B3), and contrast in the above example would be something like

 T1   T2   T3    B1    B2    B3
1/3  1/3  1/3  -1/3  -1/3  -1/3

Hopefully I am not missing anything and this makes sense. Thanks again for your time.

In other words, u((T1+T2+T3)/3)v((T1+T2+T3)/3) - u((B1+B2+B3)/3)v((B1+B2+B3)/3) is not the same as ( (u(T1)v(T1) - u(B1)v(B1)) + (u(T2)v(T2) - u(B2)v(B2)) + (u(T3)v(T3) - u(B3)v(B3)) )/3

Yes -- that is for better or worse intrinsic to the model. The former quantity (log fold change of the average effect) is going to be different than the average of the log fold change. A workaround might be to stratify into 3 groups, calculate the log-fold change in each group, then average those -- you could propagate error by just summing the variances and dividing by the number of summands squared (eg 9 in your example) because the estimates will be independent (in the statistical sense) from each other.

From a "multifactorial ANOVA" standpoint, if you are fitting an interaction model (between individual and treatment), does it make sense to interpret an average effect? Some people say not.

@amcdavid Thanks for the suggestions. I can see that it's tricky to accommodate for a general case in the computation of logFC. I guess this issue can now be closed.

From a "multifactorial ANOVA" standpoint, if you are fitting an interaction model (between individual and treatment), does it make sense to interpret an average effect? Some people say not.

Yep sure I agree that if the interaction effect is non-trivial then averaging the main effect is probably not meaningful or informative. This is an independent data analysis consideration though, but points well taken and thanks for mentioning this.

RGLab / MAST

Support for passing hypothesis to doLRT in summary() as in lrTest() #171