GabrielHoffman / variancePartition

Quantify and interpret divers of variation in multilevel gene expression experiments
http://gabrielhoffman.github.io/variancePartition/
60 stars 14 forks source link

Enable fitting of non-full rank formula #95

Closed jmichuda closed 5 months ago

jmichuda commented 9 months ago
Here's some example metadata that demonstrates the issue I'm encountering. subject dose timepoint plate
control control control plate 1
control control control plate 2
indiv1 dose 1 day 1 plate 1
indiv1 dose 1 day 2 plate 2
indiv2 dose 2 day 1 plate 1
indiv2 dose 2 day 2 plate 2

Basically I'd like to build a formula of ~ dose:timepoint + (1 | subject) + (1 | plate)

The control samples have no dose or timepoint information, which triggers the following error when running dream():

Error: the fixed-effects model matrix is column rank deficient (rank(X) = 12 < 13 = p);
the fixed effects will be jointly unidentifiable

I'm not interested in modeling the fixed effect for the control samples, I just want them to be used to help estimate the random effect of plate. If variancePartition::dream() took a design matrix as input, we could avoid this by removing repetitive columns. But it seems like it only supports formula inputs. Any advice on how to get around this issue? Ideally would like to avoid 2-part models.

I've also tried modifying the lmer control to just drop columns and only pass in the contrasts that I care about (contrasts made using makeContrastsDream) but it appears this functionality doesn't exist:

fitmm <- variancePartition::dream(vobjDream, form, meta,  L = contrasts_matrix, quiet = TRUE, control = 
  lme4::lmerControl(
    calc.derivs = FALSE,
    check.conv.singular = lme4::.makeCC("ignore", tol = 1e-4),
    check.rankX = "message+drop.cols"
    )
  )

Error in variancePartition::dream(vobjDream, form, diffex_meta, L = contrasts_matrix,  : 
  Names of entries in L must match fixed effects

And also tried using an indicator value in the formula:

~ (1 - control_indicator) * (dose:timepoint) + (1 | subject) + (1 | plate)

But then lme4 fails to identify the fixed effect in the model.

The core issue is that variancePartition's dream module creates the full contrast matrix here, which ultimately relies on lme4::no_bars() to parse the formula and the metadata for the fixed effects model matrix. I could modify the .getFixefNames function to exclude fixed effects from the model matrix that are colinear or duplicates of other variables.

GabrielHoffman commented 9 months ago

Was this resolved? I’m traveling until Monday, so I likely won’t be able to answer until then

jmichuda commented 9 months ago

Hi Gabriel, Thanks for your response. I haven't quite figured it out yet. Essentially what I'm trying to do is something analogous to the following workflow in base limma/voom:

# create design matrix with no linear combinations
form <- ~ dose_factor + day_factor + day_factor:dose_factor
design <- model.matrix(form, diffex_meta)
linCombos <- caret::findLinearCombos(design)
design <- design[, -unique(unlist(linCombos$linearCombos))]

fit <- voomLmFit(gExpr, design)
fit <- eBayes(fit)

I'm able to remove columns with linear combinations from the model matrix prior to fitting the model. Unfortunately it doesn't seem like variancePartition::dream() supports a similar syntax of inputting a design matrix.

GabrielHoffman commented 9 months ago

Let's consider the design matrix from your data:

dose = c("Control", "Control", "D1", "D1", "D2", "D2")
timePoint = c("Control", "Control", "Day1", "Day2", "Day1", "Day2")

df = data.frame(dose, timePoint)

model.matrix(~ dose + timePoint, df)
#>   (Intercept) doseD1 doseD2 timePointDay1 timePointDay2
#> 1           1      0      0             0             0
#> 2           1      0      0             0             0
#> 3           1      1      0             1             0
#> 4           1      1      0             0             1
#> 5           1      0      1             1             0
#> 6           1      0      1             0             1

This is not full rank, so these parameters can't be estimated together in the model.

In my experience, manually dropping columns from the design matrix is problematic because the formula is no longer interpretable after you drop columns. Instead of addressing this at an engineering level, we can take this result to mean that the regression model is overly complex and estimates too many parameters for the data we have. It means that the effect of dose and time can't be decoupled in this study design.

Instead, you should recode your meta-data matrix and create a design matrix that is full rank.