Hugely different R2 values for slightly different models

easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)

https://easystats.github.io/performance/

GNU General Public License v3.0

1k stars 87 forks source link

Hugely different R2 values for slightly different models #332

Open wmacnair opened 3 years ago

wmacnair commented 3 years ago

I've been fitting some mixed models to genomic (transcriptomic) count data, and wanted to assess model fit. I've tried out a few models and a few options for R2, however they give wildly different R2 values.

I have a couple of questions:

Could this be a bug? I kind of expect not, but worth checking.
Are there reasons why this data is not appropriate for the Nakagawa R2 calculation? If so, are there general principles that could be added to the documentation?
Are these different ways of calculating R2 just not comparable in this way? If so, hopefully this could be a useful case study.

Thanks! Will

MWE

The data has 8 different celltypes, across 196 individuals (for almost all individuals we have all 8 possible values). counts corresponds to how many times we observe a given gene, libsize is the offset for this observation, and logcpm = log( counts / libsize * 1e6 + 1) i.e. log values slightly pushed away from 0 and normalized to the expected range.

Plotting these log values indicates that we expect both celltype and invididual to be relevant:

logcpm_by_celltype logcpm_by_individual

I fit 4 models and calculate R2 for each:

## model 1: negative binomial with random effect
fit_glmm    = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
  offset = log(libsize), data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm)
# # R2 for Mixed Models
#   Conditional R2: 0.084
#      Marginal R2: 0.054
# 4: mu of 0.0 is too close to zero, estimate of random effect variances may be
#   unreliable.

## model 2: negative binomial without random effect
fit_glm     = glm(counts ~ celltype,
  offset = log(libsize), data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glm)
# # R2 for Generalized Linear Regression
#   Nagelkerke's R2: 0.738

## model 3: log CPM with random effect
fit_lmer    = lme4::lmer(logcpm ~ celltype + (1 | individual), data = data_dt)
performance::r2(fit_lmer)
# # R2 for Mixed Models

#   Conditional R2: 0.444
#      Marginal R2: 0.215

## model 4: log CPM without random effect
fit_lm      = lm(logcpm ~ celltype, data = data_dt)
performance::r2(fit_lm)
# # R2 for Linear Regression
#        R2: 0.214
#   adj. R2: 0.211

The main thing I don't understand is why there is such a large difference between models 1 and 2. An R2 of 8% just doesn't seem plausible when the distributions are so tight (while an R2 of 74% for model 2 seems about right). The coefficient estimates are pretty similar, so it seems more likely that the difference comes from how the R2 is calculated, rather the fits being wildly different.

Does the warning for model 1 indicate that there are issues here? If so, what could they be? To me it doesn't seem like an unreasonably complex model for the data...

I've half-read the Nakagawa paper, so could it be that for model 1 the R2 is based on log values, while for R2 it uses deviance? Could that explain the huge difference?

(The differences between the GLM and log-linear models seem to make sense, as with many zeros present the GLM should better capture the data.)

wmacnair commented 3 years ago

One thought is whether it could be related to offset terms. In insight:::.variance_distributional, a warning is thrown if exp(null_fixef) is too small. This makes sense if there is no offset, but it seems like it's missing something if it's also thrown without checking any offsets. For example in my case, the intercept term is tiny, but that's because the offsets are huge.

This made me wonder whether the R2 defined for Quasi-Poisson / negative-binomial needs to be tweaked to handle offset terms. The Nakagawa R2 is defined in eqn 3.3 here, as:

R2_QP = sigma_f^2 / (sigma_f^2 + sigma_a^2 + ln(1 + w / L))

In the standard no-offset model, var(beta_0) = 0, however with offsets it feels like this component of the variance is missing. Could we also need something like a _sigmaoffset^2 in the numerator?

bwiernik commented 3 years ago

A couple of thoughts:

Comparing pseudo-R²s

In a linear model, R² has several interpretations:

Proportion of variance accounted for
Standardized prediction accuracy (reduction in mean squared error)
Relative increase in likelihood over null model
Correlation of fitted line with response

In a generalized linear model, no single statistic has all of these meanings at once. It's not always easy to compare a pseudo-R² with R² from a normal model, eg:

Not all pseudo-R²s range from 0 to 1
Typical/reasonable values differ from normal linear model R² (e.g., likelihood-based pseudo-R² don’t tend to get that high…)

Here is a nice FAQ on various pseudo-R2 values.

Here is a comparison of various pseudo-R² estimators for binomial/logistic models: $A table comparing pseudo-R2 estimators based on their relationship to normal linear R2: Normal R2 property: Relative increase in likelihood over null model Pseudo-R2: Nagelkerke's R2 Equation: R2 = { 1 − (Lint/ Lfull)2/N} / { 1 − ( Lint)2/N} R function: performance::r2_nagelkerke() Normal R2 property: Proportion of variance accounted for Pseudo-R2: McKelvey & Zavoina’s R2 Equation: R2 = var( ŷ) / [ var( ŷ) + var( dist) ] R function: performance::r2_mckelvey() Normal R2 property: Standardized prediction accuracy Pseudo-R2: Tjur’s R2 Equation: R2 = p̂̅y= 1− p̂̅y= 0 R function: performance::r2_tjur() Normal R2 property: Correlation of fitted line with response Pseudo-R2: Somers's D Equation: D = R = rankcorr(y, p̂) R function: performance::r2_somers() Final comment: The generic performance::r2()function will select usually the “best” R2 to report for a given type of model. I personally recommend Tjur’s R2 and/or Somer’s D for logistic models$

For mixed models, the Nakagawa R2 values are most similar to the McKelvey & Zavoina’s R2. Personally, I don't think these are particuarly meaningful--the distribution variance has really different meanings across families and in basically any non-Normal family, it doesn't meaningfully indicate "model fit" in any real sense (in my opinion).

Model choice

I wouldn't recommend the logcpm model. It's kind of a cludgy solution versus modeling the counts with a more appropriate distribution (such as Poisson and friends). As you note, with the many 0s, the residuals in the logcpm model aren't close to normally distributed.

The reason you are getting warning message is because the estimated mean of the null distribution (and thus the variance of the null distribution) is zero. This is because of the huge offset you are applying. In the lognormal model, you effectively have an offset of log(libsize/1e6) = log(libsize) - log(1e6). When we apply the same offset to the negative binomial model, there is no warning and the R² values look reasonable?

fit_glmm2 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
                           offset = log(libsize) - log(1e6), ziformula = ~ 1 + celltype, data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm2)
# # R2 for Mixed Models
# 
#   Conditional R2: 0.127
#   Marginal R2: 0.081

In the figure you give, the cell types aren't very different from each other, so I would expect a modest pseudo-R² in terms of variance accounted for. The values for the normal linear models seem oddly high based on the figure.

wmacnair commented 3 years ago

Hi @bwiernik, thanks very much for the comments and resources.

Regarding model choice, something like a mixed negative binomial model is clearly the right choice for this kind of data; I just fit the logcpm models to give more interpretable comparators for the pseudo R2 methods.

I agree that a variance-based model doesn't make much sense for something counts-based, especially when the counts are small. (I'm also not a big fan of an R2 measure whose value changes so radically with a trivial reparameterization of the model...) Although for genes with larger counts, the GLMM and log-normal models should be closer, so should make more sense, right?

Conceptually, I like the relative likelihood approaches. However, it doesn't seem like these are an option for non-logistic GLMMs, at least at the moment. Is that correct? Is there no equivalent to Nagelkerke's R2 where you calculate it with both L_full and something like L_fixed_only?

The broader context is that I am fitting this model to all of our genes (i.e. the design matrix remains identical, and we fit each gene's different set of counts). The reason the Nakagawa R2 is appealing is that it allows us to see, broadly, which genes are highly determined by celltype, and which vary a lot across individuals. Do you have any thoughts on what the best approach to do something like this would be? At the moment I'm thinking of using the Nakagawa results, but adding a warning that these may not be so trustworthy for genes with smaller counts.

Many thanks Will

wmacnair commented 3 years ago

Also a brief note on your zero-inflated model - you get R2 values of around 13%:

fit_glmm2 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
  offset = log(libsize) - log(1e6), ziformula = ~ 1 + celltype, 
  data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm2)
# # R2 for Mixed Models
# 
#   Conditional R2: 0.127
#   Marginal R2: 0.081

However, if I fit a model with your suggested offset adjustment, but without the zero inflation, I get much higher R2 values:

fit_glmm3    = glmmTMB::glmmTMB(counts ~ celltype + (1|individual), 
  offset = log(libsize) - log(1e6), 
  data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm3)
# # R2 for Mixed Models

#   Conditional R2: 0.670
#      Marginal R2: 0.427

This seems odd - surely a more complex model should better explain the data?

bwiernik commented 3 years ago

I had forgotten to remove the zero-inflation bit. Thanks for checking. With respect to the values, the "more complex model" heuristic you are using just doesn't work outside of normal linear models. I am not exactly sure off hand how the Nakagwa R2 functions are written to handle mixture models with zero-inflation or dispersion parameters. In particular, the distributional variance is going to be extremely tricky with both binomial and Poisson components in the model; I don't know if we handle that well here. But, I don't think that those are going to be meaningful. Like I said above, the distributional variance is determined by which likelihood family you choose, and it doesn't really work for the type of interpretation you're wanting to do.

These are fairly trivial reparameterizations for the model coefficients, but the underlying likelihood functions being used to fit them is quite different and, especially, their statements about the variance about the response variable are extremely different (a normal distribution has a free variance, a negative binomial model has a variance proportional to the mean). Statistics that rely on those variance estimates will be very dependent on the model form.

With that in mind, I have 3 suggestions:

If you want to focus on model fit--for which gene sets are cell type effects more versus less supported, you could a likelihood-based R2 for comparing.
If you had multiple non-null models to compare, another option would be to fit the full and null models for each set of genes, then compute AIC weights (also called relative likelihood): MuMIn::model.sel(). That is going to give you similar information as a likelihood R2, but in a language of relative support similar to a Bayes factor.
A perhaps even better option than 2 in terms of estimation accuracy would be to average the two models, weighted by their relative likelihoods: MuMIn::model.avg(). Then you could directly interpret the model parameters for the average model.
If you want a variance accounted for type of statistic, a simpler approach than one of these R2 values could to just directly compare the size of the random effects SDs for individuals and observations (residuals) across the full and null models. This is the actual comparison you are interested in, and for a null vs 1 predictor comparison, these should consistently be smaller in the full model. I think, the ratio of the variances from the two models should be approximately chi square distributed with df = the df difference between models. You could use that to put a confidence interval on the ratio. But also probably reasonable to just report the ratio point estimate along with model fit comparison stats.

You can compute a likelihood based R2 for any model by asking for it specifically r2_ nagelkerke(). The generic r2() function just provides a usually-good default choice.

wmacnair commented 3 years ago

I'm not sure I get why a more complex model wouldn't give a better fit here. The set of models without zero-inflation is a strict subset of the models with zero-inflation, no? Can the change in the model mean that the best option without zero-inflation no longer has the highest likelihood? (In any case, I wasn't intending to use a zero-inflated model.)

I get that that in the negative binomial, the variance is mu + mu^2/lambda, so the value of mu has a huge impact on the variance. What I don't get is why the offset should make any difference. If we tweak the offset by a factor of -log(1e6), the intercept term adjusts by a factor of +log(1e6), so our estimate of any given point has exactly the same mean (and exactly the same likelihood). Is the issue that to calculate the distributional variance component we need for this pseudo-R2, we have to have one mu value for the whole distribution? Perhaps this corresponds to something like "negbin variance in the expected range" of the model?

Thanks for the suggestions. Would you mind clarifying what you mean by "null model" here? Thinking about it, there are four models that are potentially relevant, at least the first two of which seem like possible nulls: counts ~ 1, counts ~ (1 | individual), counts ~ celltype, counts ~ celltype + (1 | individual).

I'm a little confused by "directly compare the size of the random effects SDs for individuals and observations (residuals) across the full and null models". My attempt at parsing this:

calculate the SD of the individual effects (I've actually done a bit of analysis of this already)
calculate the SD of the residuals
do this for both full and null models
calculate the ratio of the variances

There are several points where I don't find this so clear...

(1) is fine, but I'm not sure how this can be directly compared with (2) - (1) is on the log scale, so I guess this would also mean calculating residuals on the log scale?
Which variance ratio are you suggesting? There are multiple options. Would it be something like: calculate SD of completely null model (to give the total variance that needs explaining); calculate SD of random effects (i.e. variance explained by individual); calculate SD of residuals (i.e. variance explained by whole model). Then you could show SD_indiv / SD_null, and SD_resid / SD_null. Or did you intend something different?

(Regarding your final comment, if I call performance::r2_nagelkerke(fit_glmm3), I get an error: no applicable method for 'r2_nagelkerke' applied to an object of class "glmmTMB".)

bwiernik commented 3 years ago

I'm not sure I get why a more complex model wouldn't give a better fit here.

R2 does not indicate "better fit" except in the case of linear regression. The likelihood and statistics derived from it, such as AIC, is what indicates model fit. You really should not try to interpret R2 values in this way when using GLMs (I would regard the log-normal model as a GLM in this case as well).

So for example:

mod_full <- glmmTMB(counts ~ cell_type + (1 | individual), offset = log(libsize) - log(1e6), data = data_dt, family = glmmTMB::nbinom2)
mod_null <- glmmTMB(counts ~ 1+ (1 | individual), offset = log(libsize) - log(1e6), data = data_dt, family = glmmTMB::nbinom2)
parameters::compare_parameters(mod_full, mod_null, effects = "random")

Either the ratio of the individual or observation/residual SDs across the models, depending on which was your level of interest (I assume the residual since I assume individual-level variation is not of substantive interest?).

easystats / performance

Hugely different R2 values for slightly different models #332

MWE

Comparing pseudo-R2s

Model choice

Comparing pseudo-R²s