FR: support gls models - Githubissues

iago-pssjd commented 3 years ago

Applying random_parameters to gls models produce errors. Also dof_kenward.

Thanks!

iago-pssjd commented 3 years ago

It is interesting that here (https://thestatsgeek.com/2020/12/30/mixed-model-repeated-measures-mmrm-in-stata-sas-and-r/) one can see estimates, SE and CI for Random Effect variances in the Stata output, but how can one get them for gls models, and more generally for other mixed models outputs in R?

strengejacke commented 3 years ago

For mixed models, can be obtained easily (requires updating insight and parameters from GitHub):

library(lme4)
#> Loading required package: Matrix
library(parameters)

m <- lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy)

model_parameters(m)
#> Parameter   | Coefficient |   SE |           95% CI | t(174) |      p
#> ---------------------------------------------------------------------
#> (Intercept) |      251.41 | 6.82 | [238.03, 264.78] |  36.84 | < .001
#> Days        |       10.47 | 1.55 | [  7.44,  13.50] |   6.77 | < .001

model_parameters(m, effects = "all")
#> # Fixed Effects
#> 
#> Parameter   | Coefficient |   SE |           95% CI | t(174) |      p
#> ---------------------------------------------------------------------
#> (Intercept) |      251.41 | 6.82 | [238.03, 264.78] |  36.84 | < .001
#> Days        |       10.47 | 1.55 | [  7.44,  13.50] |   6.77 | < .001
#> 
#> # Random Effects
#> 
#> Parameter               | Coefficient
#> -------------------------------------
#> SD (Intercept: Subject) |       24.74
#> SD (Days: Subject)      |        5.92
#> Cor (Intercept~Subject) |        0.07
#> SD (Residual)           |        5.06

model_parameters(m, effects = "all", group_level = TRUE)
#> # Fixed Effects
#> 
#> Parameter   | Coefficient |   SE |           95% CI | t(174) |      p
#> ---------------------------------------------------------------------
#> (Intercept) |      251.41 | 6.82 | [238.03, 264.78] |  36.84 | < .001
#> Days        |       10.47 | 1.55 | [  7.44,  13.50] |   6.77 | < .001
#> 
#> # Random Effects: Subject
#> 
#> Parameter         | Coefficient |    SE |           95% CI
#> ----------------------------------------------------------
#> (Intercept) [308] |        2.26 | 12.07 | [-21.40,  25.92]
#> (Intercept) [309] |      -40.40 | 12.07 | [-64.06, -16.74]
#> (Intercept) [310] |      -38.96 | 12.07 | [-62.62, -15.30]
#> (Intercept) [330] |       23.69 | 12.07 | [  0.03,  47.35]
#> (Intercept) [331] |       22.26 | 12.07 | [ -1.40,  45.92]
#> (Intercept) [332] |        9.04 | 12.07 | [-14.62,  32.70]
#> (Intercept) [333] |       16.84 | 12.07 | [ -6.82,  40.50]
#> (Intercept) [334] |       -7.23 | 12.07 | [-30.89,  16.43]
#> (Intercept) [335] |       -0.33 | 12.07 | [-23.99,  23.32]
#> (Intercept) [337] |       34.89 | 12.07 | [ 11.23,  58.55]
#> (Intercept) [349] |      -25.21 | 12.07 | [-48.87,  -1.55]
#> (Intercept) [350] |      -13.07 | 12.07 | [-36.73,  10.59]
#> (Intercept) [351] |        4.58 | 12.07 | [-19.08,  28.24]
#> (Intercept) [352] |       20.86 | 12.07 | [ -2.79,  44.52]
#> (Intercept) [369] |        3.28 | 12.07 | [-20.38,  26.93]
#> (Intercept) [370] |      -25.61 | 12.07 | [-49.27,  -1.95]
#> (Intercept) [371] |        0.81 | 12.07 | [-22.85,  24.47]
#> (Intercept) [372] |       12.31 | 12.07 | [-11.34,  35.97]
#> Days [308]        |        9.20 |  2.30 | [  4.68,  13.72]
#> Days [309]        |       -8.62 |  2.30 | [-13.14,  -4.10]
#> Days [310]        |       -5.45 |  2.30 | [ -9.97,  -0.93]
#> Days [330]        |       -4.81 |  2.30 | [ -9.33,  -0.30]
#> Days [331]        |       -3.07 |  2.30 | [ -7.59,   1.45]
#> Days [332]        |       -0.27 |  2.30 | [ -4.79,   4.25]
#> Days [333]        |       -0.22 |  2.30 | [ -4.74,   4.29]
#> Days [334]        |        1.07 |  2.30 | [ -3.44,   5.59]
#> Days [335]        |      -10.75 |  2.30 | [-15.27,  -6.23]
#> Days [337]        |        8.63 |  2.30 | [  4.11,  13.15]
#> Days [349]        |        1.17 |  2.30 | [ -3.34,   5.69]
#> Days [350]        |        6.61 |  2.30 | [  2.10,  11.13]
#> Days [351]        |       -3.02 |  2.30 | [ -7.53,   1.50]
#> Days [352]        |        3.54 |  2.30 | [ -0.98,   8.05]
#> Days [369]        |        0.87 |  2.30 | [ -3.65,   5.39]
#> Days [370]        |        4.82 |  2.30 | [  0.31,   9.34]
#> Days [371]        |       -0.99 |  2.30 | [ -5.51,   3.53]
#> Days [372]        |        1.28 |  2.30 | [ -3.23,   5.80]

^{Created on 2021-03-08 by the reprex package (v1.0.0)}

strengejacke commented 3 years ago

I rarely use gls(), thus model_parameters() for this model class might need some more love from us. Will look into it.

strengejacke commented 3 years ago

I have looked into the example from the website you posted above, but I'm not sure if there are any random effect variances. So I'd say that this issue is probably no bug in parameters. If you use lme(), you get random effects variances:

library(nlme)
mmrm <- lme(y~y0*factor(time)+trt*factor(time),
            na.action=na.omit, data=longData,
            random = ~time | id)
parameters::model_parameters(mmrm, effects = "all")
#> # Fixed Effects
#> 
#> Parameter   | Coefficient |   SE |        95% CI |     t |  df |      p
#> -----------------------------------------------------------------------
#> (Intercept) |        0.07 | 0.06 | [-0.04, 0.19] |  1.26 | 806 | 0.210 
#> y0          |        0.75 | 0.04 | [ 0.68, 0.82] | 20.15 | 455 | < .001
#> time2       |        0.14 | 0.07 | [ 0.01, 0.27] |  2.11 | 806 | 0.035 
#> time3       |        0.09 | 0.07 | [-0.04, 0.23] |  1.38 | 806 | 0.169 
#> trt         |        0.43 | 0.08 | [ 0.26, 0.59] |  5.08 | 455 | < .001
#> y0 * time2  |    3.37e-03 | 0.04 | [-0.08, 0.09] |  0.08 | 806 | 0.936 
#> y0 * time3  |       -0.08 | 0.04 | [-0.17, 0.00] | -1.95 | 806 | 0.051 
#> time2 * trt |        0.34 | 0.09 | [ 0.15, 0.52] |  3.63 | 806 | < .001
#> time3 * trt |        0.86 | 0.10 | [ 0.68, 1.05] |  9.00 | 806 | < .001
#> 
#> # Random Effects
#> 
#> Parameter          | Coefficient
#> --------------------------------
#> SD (Intercept: id) |        0.58
#> SD (time: id)      |    3.28e-03
#> Cor (Intercept~id) |       -0.03
#> SD (Residual)      |        0.83

^{Created on 2021-03-27 by the reprex package (v1.0.0)}

iago-pssjd commented 3 years ago

Sure it is not a bug (I asked for a FR) and it is maybe not explicit on gls output. Maybe I understood wrongly the Stata output of the example, but what I meant as the "Random Effect variances" is the part of the unstructured covariance matrix of the residuals:

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id:                  (empty) |
-----------------------------+------------------------------------------------
Residual: Unstructured       |
                     var(e1) |     .80163   .0531476      .7039468    .9128682
                     var(e2) |   .8411992   .0583165      .7343261    .9636265
                     var(e3) |   .7903076   .0573347      .6855573    .9110633
                  cov(e1,e2) |   .3482899   .0437791      .2624845    .4340953
                  cov(e1,e3) |   .3728598    .044828      .2849986    .4607209
                  cov(e2,e3) |   .3081405   .0443681      .2211807    .3951003

iago-pssjd commented 3 years ago

Maybe the estimates are those in mmrm$apVar?

bwiernik commented 3 years ago

The model you are reporting from Stata is not a GLS model, but a mixed effects model. The labelling of the Residuals variances as Random Effects by Stata is not really accurate. The dispersion and random effects parts of a model are distinct components. A GLS model does not have random effects variances. That page you linked to is fitting several different types of models across the various programs. The appropriate model to compare with on that page you link to would have been a nlme::lme() model or a glmmTMB::glmmTMB() model.

What is currently not accessible from any easystats functions as far as I can see is the residual correlation structure. Below, for example, is a simple model with AR(1) correlation structure.

library(nlme)
library(easystats)

fm1 <- gls(follicles ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary,
           correlation = corAR1(form = ~ 1 | Mare))

summary(fm1)
str(fm1, max.level = 1)

model_parameters(fm1)
model_performance(fm1)
get_variance(fm1)

The residual correlation structure can be obtained from fm1$modelStruct and in fm1$modelStruct$corStruct. That can be probed for both the type of structure and its parameter estimates. This would be relevant for gls, nlme, and glmmTMB models at least. I am not sure the best existing function to retrieve this, or if a new function is needed.

iago-pssjd commented 3 years ago

The model you are reporting from Stata is not a GLS model, but a mixed effects model.

The gls model accomplishes the function of adapting the Stata model. I quote the linked page: Linear mixed models are often fitted in R using the lme4 package, with the lmer function. This function however does not allow us to specify a residual covariance matrix which allows for dependency. We thus instead use the gls in the older nlme package.

The labelling of the Residuals variances as Random Effects by Stata is not really accurate. The dispersion and random effects parts of a model are distinct components. A GLS model does not have random effects variances.

In fact, this was my error when introducing this issue above. As shown some comments above, Stata labelling seems appropiate. Indeed, note that Random-Effects Parameters labels a row which is empty, since the model does not include random interepts nor coefficients. Meanwhile, what seem to be the Residual variances are labelled as Residual: Unstructured:

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id:                  (empty) |
-----------------------------+------------------------------------------------
Residual: Unstructured       |
                     var(e1) |     .80163   .0531476      .7039468    .9128682
                     var(e2) |   .8411992   .0583165      .7343261    .9636265
                     var(e3) |   .7903076   .0573347      .6855573    .9110633
                  cov(e1,e2) |   .3482899   .0437791      .2624845    .4340953
                  cov(e1,e3) |   .3728598    .044828      .2849986    .4607209
                  cov(e2,e3) |   .3081405   .0443681      .2211807    .3951003

That page you linked to is fitting several different types of models across the various programs. The appropriate model to compare with on that page you link to would have been a nlme::lme() model or a glmmTMB::glmmTMB() model.

I believe the quote of the linked page above answers this. Further, note that the estimates (and even the standard errors) of the Stata and R outputs are essentially equal.

The residual correlation structure can be obtained from fm1$modelStruct and in fm1$modelStruct$corStruct. That can be probed for both the type of structure and its parameter estimates.

Now I cannot test, but I believe I did before and the output of fm1$modelStruct$corStruct for the model of the link was the part

Correlation Structure: General
 Formula: ~time | id 
 Parameter estimate(s):
 Correlation: 
  1     2    
2 0.424      
3 0.468 0.378

which does not have a big similarity with the Residual: Unstructured output by Stata. It was looking at the values of gls output, that I found fm1$apVar the most similar data (at least the structure) to the Stata residuals.

bwiernik commented 3 years ago

Those aren't the same models. The Stata model you are showing has an unstructured covariance matrix. The gls model has an AR(1) structure. The corStruct is the analogous parameter--one correlation value between subsequent measurements.

apVar is not the same thing as the Stata variance components. That is the approximate covariance matrix among the Phi and the log sigma parameters.

iago-pssjd commented 3 years ago

The gls model in the example is using the option correlation=nlme::corSymm(form=~time | id) and it seems (https://stats.idre.ucla.edu/r/seminars/repeated-measures-analysis-with-r/) that Option “corr = corSymm” specifies that the correlation structure is unstructured.

Then, I ask again. If corStruct is the analogous parameter, why the Stata output has 6 estimates (the 3 variances and the 3 covariances) and the corStruct output has 3? What is the analogy?

strengejacke commented 3 years ago

A bit off topic, but I realized that I stopped asking "what need I do to have the same output on R as I get from other software packages", but instead "what are the other software packages doing that their output doesn't resemble R results"? For me, R is the reference software meanwhile. Of course, there are different default options for each software, but I don't expect R to do something wrong while the other software does it right.

iago-pssjd commented 3 years ago

I fully agree. And it is usual that other software produce outputs that R does not, and that there are good reasons for that. I asked for those values because I believe they have their own interest.

Then, maybe I should read

Correlation Structure: General
 Formula: ~time | id 
 Parameter estimate(s):
 Correlation: 
  1     2    
2 0.424      
3 0.468 0.378
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | time 
 Parameter estimates:
       1        2        3 
1.000000 1.024376 0.992909

as the variances and correlations (instead of covariances) among the distinct residuals strata, being they possibly computed in a distinct way? I did not do other tests, but in that case, it seems like gls is assuming that the first residual stratum has variance equal to 1?

bwiernik commented 3 years ago

No, you are misinterpreting that. The variances there are ratios of each stratum compared to the reference stratum. To obtain the variance for each stratum, you would multiply these values by sigma^2.

This is described in the nlme documentation for the various variance and correlation structure functions. https://rdrr.io/cran/nlme/man/varIdent.html

If you're interested in using gls(), I suggest you start by working your way through the nlme documentation, rather than jumping in to fit models and try to discern what the various components reported are. nlme has some odd choices, particularly when it comes to variance structure specification.

Personally, I would strongly suggest using glmmTMB instead--it can handle all manner of residual variance structures, zero-inflation, random effects, and generalized linear models.

iago-pssjd commented 3 years ago

The variances there are ratios of each stratum compared to the reference stratum. To obtain the variance for each stratum, you would multiply these values by sigma^2.

Thank you! Indeed, I get now

((Residual standard error) *  (Different standard deviations per stratum))^2 [R gls] = (Residual: Unstructured Estimate [var(e1) var(e2) var(e3)]) [Stata mixed]
((Residual standard error) *  (standard deviations per i)) * ((Residual standard error) *  (standard deviations per j)) * (corStruct [i, j]) [R gls] = (Residual: Unstructured Estimate [cov(ei, ej)]) [Stata mixed]

If you're interested in using gls(), I suggest you start by working your way through the nlme documentation, rather than jumping in to fit models and try to discern what the various components reported are. nlme has some odd choices, particularly when it comes to variance structure specification.

In fact my interest was in adapting the Stata code mixed ... nocons residuals(unstructured,...)... to R, and I arrived to that web (it's because I found this example the reason that I consider nlme::gls instead of glmmTMB, package which I use much more regularly). Then looking at it, I found interesting the discussed output by Stata, so I commented in this issue. (And I realize that, as you suggest, I should read carefully nlme documentation or even the Pinheiro and Bates book.)

Now, the question rises is how the other associated values to those variances and covariances (their SE and CI) can be computed?

bwiernik commented 3 years ago

Yeah, development on nlme is basically done and its API is really different from more modern R packages. glmmTMB can do basically everything nlme can do at this point, so I wouldn't invest much time in learning nlme given the choice.

Edit: The intrvals() function in nlme can compute confidence intervals for the fixed, variance, and correlation parameters in GLS models. The SE of a variance isn't really a very useful statistic because the variance is not symmetrically distributed--the SE of the log variance is more useful; this is what is used in this function to construct intervals on the variance parameters.

strengejacke commented 3 years ago

it can handle all manner of residual variance structures

Is there a good, easy-to-understand paper/tutorial/whatever that explains when to use or specify which variance structure? I almost always use the default in lme4/glmmTMB.

bwiernik commented 3 years ago

https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html

strengejacke commented 3 years ago

I know that vignette, however, it doesn't help me decide when to use a different covstruct than default? What would be the nature of my data, where I have to take this into account? I guess AR1 is for time-series-like data.

bwiernik commented 3 years ago

Yes, so for example, AR for time series, spatial for geographic data. Toeplitz for some genetic applications. Symmetric or heteroskedaatic symmetric for a lot of meta-analysis applications.

ghost commented 1 year ago

Let me, please, make a remark by the occasion of this discussion, regarding the glmmTMB vs. nlme:

glmmTMB cannot reproduce fully gls model, with all residual covariances. E.g. the compound symmetry in gls isn't equivalent to the random intercept in mixed models for negative within-subject correlations. Mixed models naturally won't allow for negative variance components, while gls will allow for negative correlations. I already failed twice at ignoring this issue and then I bitterly regretted.
emmeans offers Satterthwaite DoF (of key importance in clinical trials) only for the nlme::gls and not for glmmTMB.

Until recently, the nlme::gls was the only way to reproduce the marginal MMRM via SAS PROC MIXED REPEATED procedure, being the core approach to analyze longitudinal studies in clinical trials (instead of the SAS... RANDOM one). Especially in a connection with the Satterthwaite DoF (industry standard) and appropriate robust estimator of var/cov - the recommended CR2 or CR3 (available in the clubSandwich package).

Unfortunately, for the nlme::gls()

there were discrepancies in the Satterthwaite DoF between the emmeans and SAS (especially if the numerical "approximation" was used in certain cases)
there has been no support for the Kenward-Roger approximation of the DoF (asked practically every time by the regulatory agencies)

As a consequence, it happened that the results of the longitudinal analysis (ANCOVA, LDA, cLDA) done with nlme::gls() were questioned by the statistical reviewers, who noticed discrepancies when trying to reproduce the analysis in SAS. I faced such feedback a few times too. Whether we like it or not, in the Clinical Trials industry all around have been using SAS since ~1980 and it constitutes the absolute standard here. One can argue about the superiority of R's implementations over SAS(/SPSS/Stata/NCSS/WinNonlin/nQuery) but this will not help to make any progress in the submission of research results.

It's true that R is catching up, but still is missing important, domain-specific tools, including reliable procedure for the discussed MMRM or, for instance, GEE - which is used for the longitudinal analysis next to GLS (having -as usually in R- 6 or 7 packages for doing it, with serious issues, poor support and scattered functionalities).

This problem was so important, so recently a dedicated package for this kind of analysis has been created in frames of the OpenPharma initiative: https://openpharma.github.io/mmrm/main/

It's going to be a true game changer, offering SAS-compliant Satterthwaite and (soon) Kenward-Roger DoF with the support for emmeans and (soon) empirical estimator of the var/cov.

I'm commenting by the occasion of this discussion, because I believe it might be beneficial to enable support for a package so awaited and important for the pharmaceutical industry :)

IndrajeetPatil commented 1 year ago

This problem was so important, so recently a dedicated package for this kind of analysis has been created in frames of the OpenPharma initiative: https://openpharma.github.io/mmrm/main/

Thanks, @hugesingleton, for bringing this to our attention. We will try to support these models in the easystats ecosystem (see https://github.com/easystats/parameters/issues/821).

easystats / parameters

FR: support gls models #437