FR/Bug: Calculate Sigma (`get_sigma`) appropiately for `vglm` (and `vgam`) models

iago-pssjd commented 2 years ago

First, it is a bug, since, as stated in ?get_sigma

By default, get_sigma() tries to extract sigma by calling stats::sigma(). If the model-object has no sigma() method, the next step is calculating sigma as square-root of the model-deviance divided by the residual degrees of freedom. Finally, if even this approach fails, and x is a mixed model, the residual standard deviation is accessed using the square-root from get_variance_residual().

Therefore, when the model has no sigma() method get_sigma tries a calculation that not always is suitable, at least for many of vglm models. For example (from help("tobit", package = "VGAM")):

tdata <- data.frame(x2 = seq(-1, 1, length = (nn <- 100)))
set.seed(1)
Lower <- 1; Upper <- 4  # For the nonstandard Tobit model
tdata <- transform(tdata,
                   Lower.vec = rnorm(nn, Lower, 0.5),
                   Upper.vec = rnorm(nn, Upper, 0.5))
meanfun1 <- function(x) 0 + 2*x
meanfun2 <- function(x) 2 + 2*x
meanfun3 <- function(x) 2 + 2*x
meanfun4 <- function(x) 3 + 2*x
tdata <- transform(tdata,
  y1 = rtobit(nn, mean = meanfun1(x2)),  # Standard Tobit model
  y2 = rtobit(nn, mean = meanfun2(x2), Lower = Lower, Upper = Upper),
  y3 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec),
  y4 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec))
fit1 <- vglm(y1 ~ x2, tobit, data = tdata)
get_sigma(fit1)
[1] 2.694853
attr(,"class")
[1] "insight_aux" "numeric"

while the right sigma value is

exp(coef(fit1)["(Intercept):2"]) # look also at coef(fit1, matrix = TRUE)
(Intercept):2 
    0.9395564

(see https://stats.oarc.ucla.edu/r/dae/tobit-models/ and the book "Vector Generalized Linear and Additive Models" by Thomas Yee). In fact, in this package there are 2 points to consider:

on one hand (section 1.2.1 of the book), sigma is not treated as a scale parameter, like in GLM's, but it is generally (not always) modelled (as also happens with gamlss package), therefore it is found as an intercept;
on the other hand (sections 1.4 and 3.5.1 of the book), it allows for multiple and multivariate responses. In this case, should be sigma maybe a list, with each element being a matrix corresponding to the multivariate responses (using the intercepts (Intercept):2, (Intercept):4, etc.)? (I refer to the distinction between multiple and multivariate responses made in section 3.5.1 of the book, in the second half of page 115). This second point would be related to this issue I recently added to performance: https://github.com/easystats/performance/issues/430

Thanks!

strengejacke commented 2 years ago

Any example for multivariate models?

iago-pssjd commented 2 years ago

Hi @strengejacke.

I assume you ask for the multivariate response models. I cite a paragraph in the same page of the book I mentioned above

Some care is needed to distinguish between multiple responses and multivariate responses. Multiple responses are treated independently of each other, and are to be inputted side-by-side on the LHS of the formula using cbind().Eachresponse may be a vector, of dimension Q1. Multivariate responses correspond to Q1 > 1, and are handled by some full-likelihood model that takes into account their joint distribution.

The example given by the author is the family binormal, so

set.seed(123); nn <- 1000
bdata <- data.frame(x2 = runif(nn), x3 = runif(nn))
bdata <- transform(bdata, y1 = rnorm(nn, 1 + 2 * x2),
                          y2 = rnorm(nn, 3 + 4 * x2))
fit1 <- vglm(cbind(y1, y2) ~ x2,
             binormal(), data = bdata, trace = FALSE)
coef(fit1, matrix = TRUE)
               mean1    mean2 loglink(sd1) loglink(sd2) rhobitlink(rho)
(Intercept) 1.021178 2.929393  0.009053151  -0.02282574      0.05231844
x2          2.042807 4.101541  0.000000000   0.00000000      0.00000000

In this case sigma would be the exp of the diagonal matrix 0.009053151, -0.02282574. (Further, we would have an intercept-vector 1.021178, 2.929393 and an x2-slope vector 2.042807, 4.101541). Other examples would be with the families trinormal or bistudentt (as shown in the help pages of these functions). More implamented bivariate distributions are given in table 13.1 (page 382) of the book.

Then, first to get the parameters and other coefficients of VGAM models, and second, to distinguish between multiple responses and multivariate responses, one should look at the vglm-object@family. Therefore, the way to get them, and in particular sigma, is not so immediate as https://github.com/easystats/insight/blob/a81c92c59dac4b2c6a212b225a8295ebda8af354/R/get_sigma.R#L96-L101

Let's look some other examples.

Multiple responses and multivariate responses simultaneously (got from ?bistudentt)

nn <- 1000
mydof <- logloglink(1, inverse = TRUE)
ymat <- cbind(rt(nn, df = mydof), rt(nn, df = mydof))
bdata <- data.frame(y1 = ymat[, 1], y2 = ymat[, 2],
                    y3 = ymat[, 1], y4 = ymat[, 2], x2 = runif(nn))
## Not run:  plot(ymat, col = "blue") 
fit1 <- vglm(cbind(y1, y2, y3, y4) ~ 1,  # 2 responses, e.g., (y1,y2) is the 1st
             fam = bistudentt,  # crit = "coef",  # Sometimes a good idea
             data = bdata, trace = FALSE)
coef(fit1, matrix = TRUE)
            logloglink(df1) rhobitlink(rho1) logloglink(df2) rhobitlink(rho2)
(Intercept)        1.072883      -0.01766264        1.072882      -0.01765367

In this case, as we had two responses, we would have two sigma matrices, both having 1 in all the elements of the main diagonal, and having as secondary diagonal the values rho1 (maybe = rhobitlink(-0.01766264, inverse = TRUE)?) and rho2 (maybe = rhobitlink(-0.01765367, inverse = TRUE)?) respectively.

Multiple responses (got from ?tobit)

tdata <- data.frame(x2 = seq(-1, 1, length = (nn <- 100)))
set.seed(1)
Lower <- 1; Upper <- 4  # For the nonstandard Tobit model
tdata <- transform(tdata,
               Lower.vec = rnorm(nn, Lower, 0.5),
               Upper.vec = rnorm(nn, Upper, 0.5))
meanfun1 <- function(x) 0 + 2*x
meanfun2 <- function(x) 2 + 2*x
meanfun3 <- function(x) 2 + 2*x
meanfun4 <- function(x) 3 + 2*x
tdata <- transform(tdata,
y1 = rtobit(nn, mean = meanfun1(x2)),  # Standard Tobit model
y2 = rtobit(nn, mean = meanfun2(x2), Lower = Lower, Upper = Upper),
y3 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec),
y4 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec))
fit4 <- vglm(cbind(y3, y4) ~ x2,
         tobit(Lower = rep(with(tdata, Lower.vec), each = 2),
               Upper = rep(with(tdata, Upper.vec), each = 2),
               byrow.arg = TRUE),
         data = tdata, crit = "coeff", trace = FALSE)
coef(fit4, matrix = TRUE)
             mu1 loglink(sd1)      mu2 loglink(sd2)
(Intercept) 1.950757   0.04030946 1.944842  0.008811054
x2          1.899930   0.00000000 2.122347  0.000000000

In this case, we would have two sigma values, one for each response, being coef(fit4, matrix = TRUE)["(Intercept)","loglink(sd1)"] (or also coef(fit4)["(Intercept):2"]) and coef(fit4, matrix = TRUE)["(Intercept)","loglink(sd2)"] (or also coef(fit4)["(Intercept):4"])

A more trivial example (from ?vglm): same as for glm

d.AD <- data.frame(treatment = gl(3, 3),
                     outcome = gl(3, 1, 9),
                     counts = c(18,17,15,20,10,20,25,13,12))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson,
             data = d.AD)
vglm.D93 <- vglm(counts ~ outcome + treatment, family = poissonff,
             data = d.AD, trace = FALSE)
sigma(glm.D93)
[1] 1.13238
sigma(vglm.D93)
[1] 1.13238

Finally, other arguments given to family may be relevant, like zero, nointercept and other (table 18.6, page 518 in the book), even this may appear summarized in some way by the vglm-object@family object.

iago-pssjd commented 2 years ago

Another thought related to this issue is if sigma should be reported among the "performance" values of a model, when it is being estimated by the model. Probably, reporting it as until now, among the "parameters", is better, but would look even better if we could detect it by means of reading loglink(sd) or sigma (or some other alternative) instead of (Intercept):2.

strengejacke commented 2 years ago

I think this will take some time until I (or someone else) can look into this, seems not to be very straightforward. If you are able to, we of course also appreciate PRs on this issue :-)

easystats / insight

FR/Bug: Calculate Sigma (`get_sigma`) appropiately for `vglm` (and `vgam`) models #578