Closed iago-pssjd closed 1 year ago
It is interesting that here (https://thestatsgeek.com/2020/12/30/mixed-model-repeated-measures-mmrm-in-stata-sas-and-r/) one can see estimates, SE and CI for Random Effect variances in the Stata output, but how can one get them for gls
models, and more generally for other mixed models outputs in R?
For mixed models, can be obtained easily (requires updating insight and parameters from GitHub):
library(lme4)
#> Loading required package: Matrix
library(parameters)
m <- lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy)
model_parameters(m)
#> Parameter | Coefficient | SE | 95% CI | t(174) | p
#> ---------------------------------------------------------------------
#> (Intercept) | 251.41 | 6.82 | [238.03, 264.78] | 36.84 | < .001
#> Days | 10.47 | 1.55 | [ 7.44, 13.50] | 6.77 | < .001
model_parameters(m, effects = "all")
#> # Fixed Effects
#>
#> Parameter | Coefficient | SE | 95% CI | t(174) | p
#> ---------------------------------------------------------------------
#> (Intercept) | 251.41 | 6.82 | [238.03, 264.78] | 36.84 | < .001
#> Days | 10.47 | 1.55 | [ 7.44, 13.50] | 6.77 | < .001
#>
#> # Random Effects
#>
#> Parameter | Coefficient
#> -------------------------------------
#> SD (Intercept: Subject) | 24.74
#> SD (Days: Subject) | 5.92
#> Cor (Intercept~Subject) | 0.07
#> SD (Residual) | 5.06
model_parameters(m, effects = "all", group_level = TRUE)
#> # Fixed Effects
#>
#> Parameter | Coefficient | SE | 95% CI | t(174) | p
#> ---------------------------------------------------------------------
#> (Intercept) | 251.41 | 6.82 | [238.03, 264.78] | 36.84 | < .001
#> Days | 10.47 | 1.55 | [ 7.44, 13.50] | 6.77 | < .001
#>
#> # Random Effects: Subject
#>
#> Parameter | Coefficient | SE | 95% CI
#> ----------------------------------------------------------
#> (Intercept) [308] | 2.26 | 12.07 | [-21.40, 25.92]
#> (Intercept) [309] | -40.40 | 12.07 | [-64.06, -16.74]
#> (Intercept) [310] | -38.96 | 12.07 | [-62.62, -15.30]
#> (Intercept) [330] | 23.69 | 12.07 | [ 0.03, 47.35]
#> (Intercept) [331] | 22.26 | 12.07 | [ -1.40, 45.92]
#> (Intercept) [332] | 9.04 | 12.07 | [-14.62, 32.70]
#> (Intercept) [333] | 16.84 | 12.07 | [ -6.82, 40.50]
#> (Intercept) [334] | -7.23 | 12.07 | [-30.89, 16.43]
#> (Intercept) [335] | -0.33 | 12.07 | [-23.99, 23.32]
#> (Intercept) [337] | 34.89 | 12.07 | [ 11.23, 58.55]
#> (Intercept) [349] | -25.21 | 12.07 | [-48.87, -1.55]
#> (Intercept) [350] | -13.07 | 12.07 | [-36.73, 10.59]
#> (Intercept) [351] | 4.58 | 12.07 | [-19.08, 28.24]
#> (Intercept) [352] | 20.86 | 12.07 | [ -2.79, 44.52]
#> (Intercept) [369] | 3.28 | 12.07 | [-20.38, 26.93]
#> (Intercept) [370] | -25.61 | 12.07 | [-49.27, -1.95]
#> (Intercept) [371] | 0.81 | 12.07 | [-22.85, 24.47]
#> (Intercept) [372] | 12.31 | 12.07 | [-11.34, 35.97]
#> Days [308] | 9.20 | 2.30 | [ 4.68, 13.72]
#> Days [309] | -8.62 | 2.30 | [-13.14, -4.10]
#> Days [310] | -5.45 | 2.30 | [ -9.97, -0.93]
#> Days [330] | -4.81 | 2.30 | [ -9.33, -0.30]
#> Days [331] | -3.07 | 2.30 | [ -7.59, 1.45]
#> Days [332] | -0.27 | 2.30 | [ -4.79, 4.25]
#> Days [333] | -0.22 | 2.30 | [ -4.74, 4.29]
#> Days [334] | 1.07 | 2.30 | [ -3.44, 5.59]
#> Days [335] | -10.75 | 2.30 | [-15.27, -6.23]
#> Days [337] | 8.63 | 2.30 | [ 4.11, 13.15]
#> Days [349] | 1.17 | 2.30 | [ -3.34, 5.69]
#> Days [350] | 6.61 | 2.30 | [ 2.10, 11.13]
#> Days [351] | -3.02 | 2.30 | [ -7.53, 1.50]
#> Days [352] | 3.54 | 2.30 | [ -0.98, 8.05]
#> Days [369] | 0.87 | 2.30 | [ -3.65, 5.39]
#> Days [370] | 4.82 | 2.30 | [ 0.31, 9.34]
#> Days [371] | -0.99 | 2.30 | [ -5.51, 3.53]
#> Days [372] | 1.28 | 2.30 | [ -3.23, 5.80]
Created on 2021-03-08 by the reprex package (v1.0.0)
I rarely use gls()
, thus model_parameters()
for this model class might need some more love from us. Will look into it.
I have looked into the example from the website you posted above, but I'm not sure if there are any random effect variances. So I'd say that this issue is probably no bug in parameters. If you use lme()
, you get random effects variances:
library(nlme)
mmrm <- lme(y~y0*factor(time)+trt*factor(time),
na.action=na.omit, data=longData,
random = ~time | id)
parameters::model_parameters(mmrm, effects = "all")
#> # Fixed Effects
#>
#> Parameter | Coefficient | SE | 95% CI | t | df | p
#> -----------------------------------------------------------------------
#> (Intercept) | 0.07 | 0.06 | [-0.04, 0.19] | 1.26 | 806 | 0.210
#> y0 | 0.75 | 0.04 | [ 0.68, 0.82] | 20.15 | 455 | < .001
#> time2 | 0.14 | 0.07 | [ 0.01, 0.27] | 2.11 | 806 | 0.035
#> time3 | 0.09 | 0.07 | [-0.04, 0.23] | 1.38 | 806 | 0.169
#> trt | 0.43 | 0.08 | [ 0.26, 0.59] | 5.08 | 455 | < .001
#> y0 * time2 | 3.37e-03 | 0.04 | [-0.08, 0.09] | 0.08 | 806 | 0.936
#> y0 * time3 | -0.08 | 0.04 | [-0.17, 0.00] | -1.95 | 806 | 0.051
#> time2 * trt | 0.34 | 0.09 | [ 0.15, 0.52] | 3.63 | 806 | < .001
#> time3 * trt | 0.86 | 0.10 | [ 0.68, 1.05] | 9.00 | 806 | < .001
#>
#> # Random Effects
#>
#> Parameter | Coefficient
#> --------------------------------
#> SD (Intercept: id) | 0.58
#> SD (time: id) | 3.28e-03
#> Cor (Intercept~id) | -0.03
#> SD (Residual) | 0.83
Created on 2021-03-27 by the reprex package (v1.0.0)
Sure it is not a bug (I asked for a FR) and it is maybe not explicit on gls
output. Maybe I understood wrongly the Stata output of the example, but what I meant as the "Random Effect variances" is the part of the unstructured covariance matrix of the residuals:
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: (empty) |
-----------------------------+------------------------------------------------
Residual: Unstructured |
var(e1) | .80163 .0531476 .7039468 .9128682
var(e2) | .8411992 .0583165 .7343261 .9636265
var(e3) | .7903076 .0573347 .6855573 .9110633
cov(e1,e2) | .3482899 .0437791 .2624845 .4340953
cov(e1,e3) | .3728598 .044828 .2849986 .4607209
cov(e2,e3) | .3081405 .0443681 .2211807 .3951003
Maybe the estimates are those in mmrm$apVar
?
The model you are reporting from Stata is not a GLS model, but a mixed effects model. The labelling of the Residuals variances as Random Effects by Stata is not really accurate. The dispersion and random effects parts of a model are distinct components.
A GLS model does not have random effects variances. That page you linked to is fitting several different types of models across the various programs. The appropriate model to compare with on that page you link to would have been a nlme::lme()
model or a glmmTMB::glmmTMB()
model.
What is currently not accessible from any easystats functions as far as I can see is the residual correlation structure. Below, for example, is a simple model with AR(1) correlation structure.
library(nlme)
library(easystats)
fm1 <- gls(follicles ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary,
correlation = corAR1(form = ~ 1 | Mare))
summary(fm1)
str(fm1, max.level = 1)
model_parameters(fm1)
model_performance(fm1)
get_variance(fm1)
The residual correlation structure can be obtained from fm1$modelStruct
and in fm1$modelStruct$corStruct
. That can be probed for both the type of structure and its parameter estimates. This would be relevant for gls
, nlme
, and glmmTMB
models at least. I am not sure the best existing function to retrieve this, or if a new function is needed.
The model you are reporting from Stata is not a GLS model, but a mixed effects model.
The gls
model accomplishes the function of adapting the Stata model. I quote the linked page: Linear mixed models are often fitted in R using the lme4 package, with the lmer function. This function however does not allow us to specify a residual covariance matrix which allows for dependency. We thus instead use the gls in the older nlme package.
The labelling of the Residuals variances as Random Effects by Stata is not really accurate. The dispersion and random effects parts of a model are distinct components. A GLS model does not have random effects variances.
In fact, this was my error when introducing this issue above. As shown some comments above, Stata labelling seems appropiate. Indeed, note that Random-Effects Parameters labels a row which is empty, since the model does not include random interepts nor coefficients. Meanwhile, what seem to be the Residual variances are labelled as Residual: Unstructured:
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: (empty) |
-----------------------------+------------------------------------------------
Residual: Unstructured |
var(e1) | .80163 .0531476 .7039468 .9128682
var(e2) | .8411992 .0583165 .7343261 .9636265
var(e3) | .7903076 .0573347 .6855573 .9110633
cov(e1,e2) | .3482899 .0437791 .2624845 .4340953
cov(e1,e3) | .3728598 .044828 .2849986 .4607209
cov(e2,e3) | .3081405 .0443681 .2211807 .3951003
That page you linked to is fitting several different types of models across the various programs. The appropriate model to compare with on that page you link to would have been a
nlme::lme()
model or aglmmTMB::glmmTMB()
model.
I believe the quote of the linked page above answers this. Further, note that the estimates (and even the standard errors) of the Stata and R outputs are essentially equal.
The residual correlation structure can be obtained from
fm1$modelStruct
and infm1$modelStruct$corStruct
. That can be probed for both the type of structure and its parameter estimates.
Now I cannot test, but I believe I did before and the output of fm1$modelStruct$corStruct
for the model of the link was the part
Correlation Structure: General
Formula: ~time | id
Parameter estimate(s):
Correlation:
1 2
2 0.424
3 0.468 0.378
which does not have a big similarity with the Residual: Unstructured output by Stata. It was looking at the values of gls
output, that I found fm1$apVar
the most similar data (at least the structure) to the Stata residuals.
Those aren't the same models. The Stata model you are showing has an unstructured covariance matrix. The gls model has an AR(1) structure. The corStruct is the analogous parameter--one correlation value between subsequent measurements.
apVar is not the same thing as the Stata variance components. That is the approximate covariance matrix among the Phi and the log sigma parameters.
The gls
model in the example is using the option correlation=nlme::corSymm(form=~time | id)
and it seems (https://stats.idre.ucla.edu/r/seminars/repeated-measures-analysis-with-r/) that Option “corr = corSymm” specifies that the correlation structure is unstructured.
Then, I ask again. If corStruct is the analogous parameter, why the Stata output has 6 estimates (the 3 variances and the 3 covariances) and the corStruct
output has 3? What is the analogy?
A bit off topic, but I realized that I stopped asking "what need I do to have the same output on R as I get from other software packages", but instead "what are the other software packages doing that their output doesn't resemble R results"? For me, R is the reference software meanwhile. Of course, there are different default options for each software, but I don't expect R to do something wrong while the other software does it right.
I fully agree. And it is usual that other software produce outputs that R does not, and that there are good reasons for that. I asked for those values because I believe they have their own interest.
Then, maybe I should read
Correlation Structure: General
Formula: ~time | id
Parameter estimate(s):
Correlation:
1 2
2 0.424
3 0.468 0.378
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | time
Parameter estimates:
1 2 3
1.000000 1.024376 0.992909
as the variances and correlations (instead of covariances) among the distinct residuals strata, being they possibly computed in a distinct way? I did not do other tests, but in that case, it seems like gls
is assuming that the first residual stratum has variance equal to 1?
No, you are misinterpreting that. The variances there are ratios of each stratum compared to the reference stratum. To obtain the variance for each stratum, you would multiply these values by sigma^2.
This is described in the nlme documentation for the various variance and correlation structure functions. https://rdrr.io/cran/nlme/man/varIdent.html
If you're interested in using gls(), I suggest you start by working your way through the nlme documentation, rather than jumping in to fit models and try to discern what the various components reported are. nlme has some odd choices, particularly when it comes to variance structure specification.
Personally, I would strongly suggest using glmmTMB instead--it can handle all manner of residual variance structures, zero-inflation, random effects, and generalized linear models.
The variances there are ratios of each stratum compared to the reference stratum. To obtain the variance for each stratum, you would multiply these values by sigma^2.
Thank you! Indeed, I get now
((Residual standard error) * (Different standard deviations per stratum))^2 [R gls] = (Residual: Unstructured Estimate [var(e1) var(e2) var(e3)]) [Stata mixed]
((Residual standard error) * (standard deviations per i)) * ((Residual standard error) * (standard deviations per j)) * (corStruct [i, j]) [R gls] = (Residual: Unstructured Estimate [cov(ei, ej)]) [Stata mixed]
If you're interested in using gls(), I suggest you start by working your way through the nlme documentation, rather than jumping in to fit models and try to discern what the various components reported are. nlme has some odd choices, particularly when it comes to variance structure specification.
In fact my interest was in adapting the Stata code mixed ... nocons residuals(unstructured,...)...
to R, and I arrived to that web (it's because I found this example the reason that I consider nlme::gls
instead of glmmTMB
, package which I use much more regularly). Then looking at it, I found interesting the discussed output by Stata, so I commented in this issue. (And I realize that, as you suggest, I should read carefully nlme
documentation or even the Pinheiro and Bates book.)
Now, the question rises is how the other associated values to those variances and covariances (their SE and CI) can be computed?
Yeah, development on nlme is basically done and its API is really different from more modern R packages. glmmTMB can do basically everything nlme can do at this point, so I wouldn't invest much time in learning nlme given the choice.
Edit: The intrvals()
function in nlme can compute confidence intervals for the fixed, variance, and correlation parameters in GLS models. The SE of a variance isn't really a very useful statistic because the variance is not symmetrically distributed--the SE of the log variance is more useful; this is what is used in this function to construct intervals on the variance parameters.
it can handle all manner of residual variance structures
Is there a good, easy-to-understand paper/tutorial/whatever that explains when to use or specify which variance structure? I almost always use the default in lme4/glmmTMB.
I know that vignette, however, it doesn't help me decide when to use a different covstruct than default? What would be the nature of my data, where I have to take this into account? I guess AR1 is for time-series-like data.
Yes, so for example, AR for time series, spatial for geographic data. Toeplitz for some genetic applications. Symmetric or heteroskedaatic symmetric for a lot of meta-analysis applications.
Let me, please, make a remark by the occasion of this discussion, regarding the glmmTMB vs. nlme:
Until recently, the nlme::gls was the only way to reproduce the marginal MMRM via SAS PROC MIXED REPEATED procedure, being the core approach to analyze longitudinal studies in clinical trials (instead of the SAS... RANDOM one). Especially in a connection with the Satterthwaite DoF (industry standard) and appropriate robust estimator of var/cov - the recommended CR2 or CR3 (available in the clubSandwich package).
Unfortunately, for the nlme::gls()
As a consequence, it happened that the results of the longitudinal analysis (ANCOVA, LDA, cLDA) done with nlme::gls() were questioned by the statistical reviewers, who noticed discrepancies when trying to reproduce the analysis in SAS. I faced such feedback a few times too. Whether we like it or not, in the Clinical Trials industry all around have been using SAS since ~1980 and it constitutes the absolute standard here. One can argue about the superiority of R's implementations over SAS(/SPSS/Stata/NCSS/WinNonlin/nQuery) but this will not help to make any progress in the submission of research results.
It's true that R is catching up, but still is missing important, domain-specific tools, including reliable procedure for the discussed MMRM or, for instance, GEE - which is used for the longitudinal analysis next to GLS (having -as usually in R- 6 or 7 packages for doing it, with serious issues, poor support and scattered functionalities).
This problem was so important, so recently a dedicated package for this kind of analysis has been created in frames of the OpenPharma initiative: https://openpharma.github.io/mmrm/main/
It's going to be a true game changer, offering SAS-compliant Satterthwaite and (soon) Kenward-Roger DoF with the support for emmeans and (soon) empirical estimator of the var/cov.
I'm commenting by the occasion of this discussion, because I believe it might be beneficial to enable support for a package so awaited and important for the pharmaceutical industry :)
This problem was so important, so recently a dedicated package for this kind of analysis has been created in frames of the OpenPharma initiative: https://openpharma.github.io/mmrm/main/
Thanks, @hugesingleton, for bringing this to our attention.
We will try to support these models in the easystats
ecosystem (see https://github.com/easystats/parameters/issues/821).
Applying
random_parameters
togls
models produce errors. Alsodof_kenward
.Thanks!