easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
1k stars 87 forks source link

Hugely different R2 values for slightly different models #332

Open wmacnair opened 3 years ago

wmacnair commented 3 years ago

I've been fitting some mixed models to genomic (transcriptomic) count data, and wanted to assess model fit. I've tried out a few models and a few options for R2, however they give wildly different R2 values.

I have a couple of questions:

Thanks! Will

MWE

The data has 8 different celltypes, across 196 individuals (for almost all individuals we have all 8 possible values). counts corresponds to how many times we observe a given gene, libsize is the offset for this observation, and logcpm = log( counts / libsize * 1e6 + 1) i.e. log values slightly pushed away from 0 and normalized to the expected range.

Plotting these log values indicates that we expect both celltype and invididual to be relevant:

logcpm_by_celltype logcpm_by_individual

I fit 4 models and calculate R2 for each:

## model 1: negative binomial with random effect
fit_glmm    = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
  offset = log(libsize), data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm)
# # R2 for Mixed Models
#   Conditional R2: 0.084
#      Marginal R2: 0.054
# 4: mu of 0.0 is too close to zero, estimate of random effect variances may be
#   unreliable.

## model 2: negative binomial without random effect
fit_glm     = glm(counts ~ celltype,
  offset = log(libsize), data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glm)
# # R2 for Generalized Linear Regression
#   Nagelkerke's R2: 0.738

## model 3: log CPM with random effect
fit_lmer    = lme4::lmer(logcpm ~ celltype + (1 | individual), data = data_dt)
performance::r2(fit_lmer)
# # R2 for Mixed Models

#   Conditional R2: 0.444
#      Marginal R2: 0.215

## model 4: log CPM without random effect
fit_lm      = lm(logcpm ~ celltype, data = data_dt)
performance::r2(fit_lm)
# # R2 for Linear Regression
#        R2: 0.214
#   adj. R2: 0.211

The main thing I don't understand is why there is such a large difference between models 1 and 2. An R2 of 8% just doesn't seem plausible when the distributions are so tight (while an R2 of 74% for model 2 seems about right). The coefficient estimates are pretty similar, so it seems more likely that the difference comes from how the R2 is calculated, rather the fits being wildly different.

Does the warning for model 1 indicate that there are issues here? If so, what could they be? To me it doesn't seem like an unreasonably complex model for the data...

I've half-read the Nakagawa paper, so could it be that for model 1 the R2 is based on log values, while for R2 it uses deviance? Could that explain the huge difference?

(The differences between the GLM and log-linear models seem to make sense, as with many zeros present the GLM should better capture the data.)

wmacnair commented 3 years ago

One thought is whether it could be related to offset terms. In insight:::.variance_distributional, a warning is thrown if exp(null_fixef) is too small. This makes sense if there is no offset, but it seems like it's missing something if it's also thrown without checking any offsets. For example in my case, the intercept term is tiny, but that's because the offsets are huge.

This made me wonder whether the R2 defined for Quasi-Poisson / negative-binomial needs to be tweaked to handle offset terms. The Nakagawa R2 is defined in eqn 3.3 here, as:

R2_QP = sigma_f^2 / (sigma_f^2 + sigma_a^2 + ln(1 + w / L))

In the standard no-offset model, var(beta_0) = 0, however with offsets it feels like this component of the variance is missing. Could we also need something like a _sigmaoffset^2 in the numerator?

bwiernik commented 3 years ago

A couple of thoughts:

Comparing pseudo-R2s

In a linear model, R2 has several interpretations:

  1. Proportion of variance accounted for
  2. Standardized prediction accuracy (reduction in mean squared error)
  3. Relative increase in likelihood over null model
  4. Correlation of fitted line with response

In a generalized linear model, no single statistic has all of these meanings at once. It's not always easy to compare a pseudo-R2 with R2 from a normal model, eg:

  1. Not all pseudo-R2s range from 0 to 1
  2. Typical/reasonable values differ from normal linear model R2 (e.g., likelihood-based pseudo-R2 don’t tend to get that high…)

Here is a nice FAQ on various pseudo-R2 values.

Here is a comparison of various pseudo-R2 estimators for binomial/logistic models: A table comparing pseudo-R2 estimators based on their relationship to normal linear R2:
Normal R2 property: Relative increase in likelihood over null model
Pseudo-R2: Nagelkerke's R2
Equation: R2 = { 1 − (Lint/ Lfull)2/N} / { 1 − ( Lint)2/N}
R function: performance::r2_nagelkerke()
Normal R2 property: Proportion of variance accounted for
Pseudo-R2: McKelvey & Zavoina’s R2
Equation: R2 = var( ŷ) / [ var( ŷ) + var( dist) ]
R function: performance::r2_mckelvey()
Normal R2 property: Standardized prediction accuracy
Pseudo-R2: Tjur’s R2
Equation: R2 = p̂̅y= 1− p̂̅y= 0
R function: performance::r2_tjur()
Normal R2 property: Correlation of fitted line with response
Pseudo-R2: Somers's D
Equation: D = R = rankcorr(y, p̂)
R function: performance::r2_somers()
Final comment: The generic performance::r2()function will select usually the “best” R2 to report for a given type of model. I personally recommend Tjur’s R2 and/or Somer’s D for logistic models

For mixed models, the Nakagawa R2 values are most similar to the McKelvey & Zavoina’s R2. Personally, I don't think these are particuarly meaningful--the distribution variance has really different meanings across families and in basically any non-Normal family, it doesn't meaningfully indicate "model fit" in any real sense (in my opinion).

Model choice

I wouldn't recommend the logcpm model. It's kind of a cludgy solution versus modeling the counts with a more appropriate distribution (such as Poisson and friends). As you note, with the many 0s, the residuals in the logcpm model aren't close to normally distributed.

The reason you are getting warning message is because the estimated mean of the null distribution (and thus the variance of the null distribution) is zero. This is because of the huge offset you are applying. In the lognormal model, you effectively have an offset of log(libsize/1e6) = log(libsize) - log(1e6). When we apply the same offset to the negative binomial model, there is no warning and the R2 values look reasonable?

fit_glmm2 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
                           offset = log(libsize) - log(1e6), ziformula = ~ 1 + celltype, data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm2)
# # R2 for Mixed Models
# 
#   Conditional R2: 0.127
#   Marginal R2: 0.081

In the figure you give, the cell types aren't very different from each other, so I would expect a modest pseudo-R2 in terms of variance accounted for. The values for the normal linear models seem oddly high based on the figure.

wmacnair commented 3 years ago

Hi @bwiernik, thanks very much for the comments and resources.

Regarding model choice, something like a mixed negative binomial model is clearly the right choice for this kind of data; I just fit the logcpm models to give more interpretable comparators for the pseudo R2 methods.

I agree that a variance-based model doesn't make much sense for something counts-based, especially when the counts are small. (I'm also not a big fan of an R2 measure whose value changes so radically with a trivial reparameterization of the model...) Although for genes with larger counts, the GLMM and log-normal models should be closer, so should make more sense, right?

Conceptually, I like the relative likelihood approaches. However, it doesn't seem like these are an option for non-logistic GLMMs, at least at the moment. Is that correct? Is there no equivalent to Nagelkerke's R2 where you calculate it with both L_full and something like L_fixed_only?

The broader context is that I am fitting this model to all of our genes (i.e. the design matrix remains identical, and we fit each gene's different set of counts). The reason the Nakagawa R2 is appealing is that it allows us to see, broadly, which genes are highly determined by celltype, and which vary a lot across individuals. Do you have any thoughts on what the best approach to do something like this would be? At the moment I'm thinking of using the Nakagawa results, but adding a warning that these may not be so trustworthy for genes with smaller counts.

Many thanks Will

wmacnair commented 3 years ago

Also a brief note on your zero-inflated model - you get R2 values of around 13%:

fit_glmm2 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
  offset = log(libsize) - log(1e6), ziformula = ~ 1 + celltype, 
  data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm2)
# # R2 for Mixed Models
# 
#   Conditional R2: 0.127
#   Marginal R2: 0.081

However, if I fit a model with your suggested offset adjustment, but without the zero inflation, I get much higher R2 values:

fit_glmm3    = glmmTMB::glmmTMB(counts ~ celltype + (1|individual), 
  offset = log(libsize) - log(1e6), 
  data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm3)
# # R2 for Mixed Models

#   Conditional R2: 0.670
#      Marginal R2: 0.427

This seems odd - surely a more complex model should better explain the data?

bwiernik commented 3 years ago

I had forgotten to remove the zero-inflation bit. Thanks for checking. With respect to the values, the "more complex model" heuristic you are using just doesn't work outside of normal linear models. I am not exactly sure off hand how the Nakagwa R2 functions are written to handle mixture models with zero-inflation or dispersion parameters. In particular, the distributional variance is going to be extremely tricky with both binomial and Poisson components in the model; I don't know if we handle that well here. But, I don't think that those are going to be meaningful. Like I said above, the distributional variance is determined by which likelihood family you choose, and it doesn't really work for the type of interpretation you're wanting to do.

These are fairly trivial reparameterizations for the model coefficients, but the underlying likelihood functions being used to fit them is quite different and, especially, their statements about the variance about the response variable are extremely different (a normal distribution has a free variance, a negative binomial model has a variance proportional to the mean). Statistics that rely on those variance estimates will be very dependent on the model form.

With that in mind, I have 3 suggestions:

You can compute a likelihood based R2 for any model by asking for it specifically r2_ nagelkerke(). The generic r2() function just provides a usually-good default choice.

wmacnair commented 3 years ago

I'm not sure I get why a more complex model wouldn't give a better fit here. The set of models without zero-inflation is a strict subset of the models with zero-inflation, no? Can the change in the model mean that the best option without zero-inflation no longer has the highest likelihood? (In any case, I wasn't intending to use a zero-inflated model.)

I get that that in the negative binomial, the variance is mu + mu^2/lambda, so the value of mu has a huge impact on the variance. What I don't get is why the offset should make any difference. If we tweak the offset by a factor of -log(1e6), the intercept term adjusts by a factor of +log(1e6), so our estimate of any given point has exactly the same mean (and exactly the same likelihood). Is the issue that to calculate the distributional variance component we need for this pseudo-R2, we have to have one mu value for the whole distribution? Perhaps this corresponds to something like "negbin variance in the expected range" of the model?

Thanks for the suggestions. Would you mind clarifying what you mean by "null model" here? Thinking about it, there are four models that are potentially relevant, at least the first two of which seem like possible nulls: counts ~ 1, counts ~ (1 | individual), counts ~ celltype, counts ~ celltype + (1 | individual).

I'm a little confused by "directly compare the size of the random effects SDs for individuals and observations (residuals) across the full and null models". My attempt at parsing this:

  1. calculate the SD of the individual effects (I've actually done a bit of analysis of this already)
  2. calculate the SD of the residuals
  3. do this for both full and null models
  4. calculate the ratio of the variances

There are several points where I don't find this so clear...

(Regarding your final comment, if I call performance::r2_nagelkerke(fit_glmm3), I get an error: no applicable method for 'r2_nagelkerke' applied to an object of class "glmmTMB".)

bwiernik commented 3 years ago

I'm not sure I get why a more complex model wouldn't give a better fit here.

R2 does not indicate "better fit" except in the case of linear regression. The likelihood and statistics derived from it, such as AIC, is what indicates model fit. You really should not try to interpret R2 values in this way when using GLMs (I would regard the log-normal model as a GLM in this case as well).

So for example:

mod_full <- glmmTMB(counts ~ cell_type + (1 | individual), offset = log(libsize) - log(1e6), data = data_dt, family = glmmTMB::nbinom2)
mod_null <- glmmTMB(counts ~ 1+ (1 | individual), offset = log(libsize) - log(1e6), data = data_dt, family = glmmTMB::nbinom2)
parameters::compare_parameters(mod_full, mod_null, effects = "random")

Either the ratio of the individual or observation/residual SDs across the models, depending on which was your level of interest (I assume the residual since I assume individual-level variation is not of substantive interest?).