easystats / insight

:crystal_ball: Easy access to model information for various model objects
https://easystats.github.io/insight/
GNU General Public License v3.0
404 stars 39 forks source link

find_formula ignores offset when written after random terms #79

Closed strengejacke closed 4 years ago

strengejacke commented 5 years ago

This is an issue in insight:

library(insight)
library(glmmTMB)
data("Salamanders")

m1 <- glmmTMB(
  count ~ mined + (1 | site) + offset(Wtemp), 
  zi =  ~ 1, 
  family = nbinom2, 
  data = Salamanders
)

m2 <- glmmTMB(
  count ~ mined + offset(Wtemp) + (1 | site), 
  zi =  ~ 1, 
  family = nbinom2, 
  data = Salamanders
)

find_formula(m1)
#> $conditional
#> count ~ mined
#> 
#> $random
#> ~1 | site

find_formula(m2)
#> $conditional
#> count ~ mined + offset(Wtemp)
#> 
#> $random
#> ~1 | site

As my regular expression knowledge is too limited to currently find a solution, I'd suggest you just re-write your formula and put the offset-term before the random term(s).

Originally posted by @strengejacke in https://github.com/strengejacke/ggeffects/issues/72#issuecomment-484003127

DominiqueMakowski commented 5 years ago

That's an interesting edge case!

strengejacke commented 5 years ago

The related regexp might be this one in the helper-function .get_fixed_effects(): https://github.com/easystats/insight/blob/442ad486a723e0b942eb27bf8d18efc1485033c8/R/helper_functions.R#L98

bbolker commented 5 years ago

this might be a lot easier using lme4::nobars() ...

strengejacke commented 5 years ago

I'm not sure, I think we have implemented it this way to make it work with other packages like lme or GLMMadaptive as well.

strengejacke commented 5 years ago

Or maybe to reduce package dependencies.

strengejacke commented 5 years ago

Or maybe to work with panel models, like plm, feist or panelr. There has been a reason to generalize this function.

bbolker commented 5 years ago

OK. My only point was that working with the language object, as much of a pain as it is, is often more robust than deparsing and using regexes ...

strengejacke commented 5 years ago

Ok, I can narrow down this a bit. .get_fixed_effects() is called from insight::find_formula() for all model-objects that can have random effects. This includes packages like lme4, glmmTMN, GLMMadaptive, brms, rstanarm, afex, coxme, ordinal, and some more.

Now there were two reasons for finding an own solution: 1) insight, as low-level-package, should only depend on packages that ship with R (base, utils, stats, ...) 2) the function should also work for "complicated" formulas, especially for brms, there are some edge cases we needed to capture.

here are two examples (ignore the response-part), the second one giving a non-desired result.

lme4::nobars(as.formula("time | cens(censored) ~ age * sex + disease + (1|patient)"))
#> age * sex + disease
insight:::.get_fixed_effects(as.formula("time | cens(censored) ~ age * sex + disease + (1|patient)"))
#> [1] "time | cens(censored) ~ age * sex + disease"

lme4::nobars(as.formula("success | trials(ntrials) ~ x + (1 | patient)"))
#> x ~ 1
insight:::.get_fixed_effects(as.formula("success | trials(ntrials) ~ x + (1 | patient)"))
#> [1] "success | trials(ntrials) ~ x"

Created on 2019-07-01 by the reprex package (v0.3.0)

But maybe it is indeed better to find a solution based on the language-object itself instead of using regex for the parsed formula-string. I'll look into this.

strengejacke commented 4 years ago

I think I figured out the required regex pattern: "(\\+|:|\\*|/)(?![^\\(]*\\))"

I checked various examples, all extract the correct fixed effects terms and the random effects, so I can remove the RE terms from the formula:

pattern <- "(\\+|:|\\*|/)(?![^\\(]*\\))"

p <- "am ~ disp:wt + (1|gear) + wt + (1+wt|carb)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ disp:wt + wt + (1|gear) + (1+wt|carb)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ (1|gear) + (1+wt|carb) + disp:wt + wt"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ 1 + (1|gear)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "1"

p <- "am ~ disp:wt * (1|gear) + wt + (1+wt|carb)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ disp:wt * (1|gear) + wt + (1*wt|carb)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ disp:wt * wt + (1|gear) + (1+wt|carb)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ (1|gear) + (1+wt|carb) + disp:wt * wt"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp" "wt"   "wt"

p <- "am ~ 1 + (1|gear)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "1"

p <- "am ~ disp:wt + poly(gear, 2) + wt + (1+wt|carb)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "disp"          "wt"            "poly(gear, 2)" "wt"

p <- "y ~ post + time1 + (1 | g2 / g1 / g0) + (post + time1 - 1 | g2)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "post"  "time1"

p <- "count ~ mined + (1 | site) + offset(Wtemp)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "mined"         "offset(Wtemp)"

p <- "count ~ mined + offset(Wtemp) + (1 | site)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "mined"         "offset(Wtemp)"

p <- "time | cens(censored) ~ age * sex + disease + (1|patient)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "age"     "sex"     "disease"

p <- "success | trials(ntrials) ~ x + (1 | patient)"
p <- deparse(as.formula(p)[[3]])
x <- lapply(strsplit(p, pattern, perl = TRUE), trimws)
unlist(x)[!grepl("\\((.*)\\|(.*)\\)", unlist(x))]
#> [1] "x"

Created on 2020-07-12 by the reprex package (v0.3.0)

bbolker commented 4 years ago

I certainly haven't tried this on all your new examples, but simply restricting lme4::nobars() to apply only to the right-hand side of the formula resolves the original examples where nobars gave an undesired/incorrect answer.

As for self-containedness, lme4::nobars (and the associated hidden utility lme4::nobars_()) could easily be copied ...

f1 <- as.formula("time | cens(censored) ~ age * sex + disease + (1|patient)")
f2 <- as.formula("success | trials(ntrials) ~ x + (1 | patient)")

my_nb <- function(x) {
    x[[3]] <- lme4::nobars(x[[3]])
    return(x)
}

lme4::nobars(f1)
## age * sex + disease
insight:::.get_fixed_effects(f1)
## "time | cens(censored) ~ age * sex + disease"
my_nb(f1)

lme4::nobars(f2)
## x ~ 1
insight:::.get_fixed_effects(f1)
## [1] "success | trials(ntrials) ~ x"
my_nb(f1)
## time | cens(censored) ~ age * sex + disease

PS you might want to use the new deparse1 function to protect yourself against the situation where users decide to include reeeeeeeeeeeeeeeeeeeeeeeeeeaaaaaaalllllllllllly long variable names or formulas ...

strengejacke commented 4 years ago

Indeed, using lme4::nobars() only on the RHS of the formula would solve the issue, I think. The reprex (with some additional modifications) and nobars() both seem to work fine on all examples (see below).

@ deparse1: yes, we are using an internal .safe_deparse() function, that literally does the same:

https://github.com/easystats/insight/blob/master/R/helper_functions.R#L472-L477

I'm aware of the bnew deparse1(), however, due to back compatibility with older R versions, I would stick to our own solution.

mynb <- function(x) {
  x <- as.formula(x)
  x[[3]] <- lme4::nobars(x[[3]])
  x
}

mynb2 <- function(x) {
  x <- as.formula(x)

  # only RHS of formula, and regex pattern
  fe <- deparse(x[[3]])
  pattern <- "(\\+|:|\\*|/)(?![^\\(]*\\))"

  # trim at operators
  parts <- lapply(strsplit(fe, pattern, perl = TRUE), trimws)

  # extract RE parts of formula
  re <- unlist(parts)[grepl("\\((.*)\\|(.*)\\)", unlist(parts))]

  # remove each RE part
  for (i in re) {
    fe <- gsub(i, "", fe, fixed = TRUE)
  }

  # remove duplicated operator signs. When we have "x1 + (1|grp) + x2",
  # after removing we have "x1 + + x2".
  fe <- trimws(gsub("(\\+)(\\s*)(\\+)", "+", fe))

  # we might have "+" at start or end, "+ x1 + x2" or "x1 + x2 +"
  fe <- trimws(gsub("(\\s*)\\+$", "", fe))
  fe <- trimws(gsub("^\\+", "", fe))

  # re-build formula
  as.formula(paste0(deparse(x[[2]]), " ~ ", fe))
}

p <- "am ~ disp:wt + (1|gear) + wt + (1+wt|carb)"
mynb(p)
#> am ~ disp:wt + wt
#> <environment: 0x0000000015924d50>
mynb2(p)
#> am ~ disp:wt + wt
#> <environment: 0x000000001b5064f0>

p <- "am ~ disp:wt + wt + (1|gear) + (1+wt|carb)"
mynb(p)
#> am ~ disp:wt + wt
#> <environment: 0x00000000195df7c0>
mynb2(p)
#> am ~ disp:wt + wt
#> <environment: 0x000000001910b410>

p <- "am ~ (1|gear) + (1+wt|carb) + disp:wt + wt"
mynb(p)
#> am ~ disp:wt + wt
#> <environment: 0x0000000019030d18>
mynb2(p)
#> am ~ disp:wt + wt
#> <environment: 0x0000000018fb41c0>

p <- "am ~ 1 + (1|gear)"
mynb(p)
#> am ~ 1
#> <environment: 0x000000001388c230>
mynb2(p)
#> am ~ 1
#> <environment: 0x000000001be28800>

p <- "am ~ disp:wt * (1|gear) + wt + (1+wt|carb)"
mynb(p)
#> am ~ disp:wt + wt
#> <environment: 0x000000001befab48>
mynb2(p)
#> am ~ disp:wt * +wt
#> <environment: 0x000000001a717530>

p <- "am ~ disp:wt * (1|gear) + wt + (1*wt|carb)"
mynb(p)
#> am ~ disp:wt + wt
#> <environment: 0x000000001c1626e8>
mynb2(p)
#> am ~ disp:wt * +wt
#> <environment: 0x000000001c1c8ee8>

p <- "am ~ disp:wt * wt + (1|gear) + (1+wt|carb)"
mynb(p)
#> am ~ disp:wt * wt
#> <environment: 0x000000001c2845b0>
mynb2(p)
#> am ~ disp:wt * wt
#> <environment: 0x000000001c2ef1c8>

p <- "am ~ (1|gear) + (1+wt|carb) + disp:wt * wt"
mynb(p)
#> am ~ disp:wt * wt
#> <environment: 0x000000001c3af230>
mynb2(p)
#> am ~ disp:wt * wt
#> <environment: 0x000000001c417918>

p <- "am ~ 1 + (1|gear)"
mynb(p)
#> am ~ 1
#> <environment: 0x000000001c4ec4a8>
mynb2(p)
#> am ~ 1
#> <environment: 0x000000001c547db8>

p <- "am ~ disp:wt + poly(gear, 2) + wt + (1+wt|carb)"
mynb(p)
#> am ~ disp:wt + poly(gear, 2) + wt
#> <environment: 0x000000001c646e60>
mynb2(p)
#> am ~ disp:wt + poly(gear, 2) + wt
#> <environment: 0x000000001c6af698>

p <- "y ~ post + time1 + (1 | g2 / g1 / g0) + (post + time1 - 1 | g2)"
mynb(p)
#> y ~ post + time1
#> <environment: 0x000000001c784180>
mynb2(p)
#> y ~ post + time1
#> <environment: 0x000000001c9a8580>

p <- "count ~ mined + (1 | site) + offset(Wtemp)"
mynb(p)
#> count ~ mined + offset(Wtemp)
#> <environment: 0x000000001cb03728>
mynb2(p)
#> count ~ mined + offset(Wtemp)
#> <environment: 0x000000001cb69d90>

p <- "count ~ mined + offset(Wtemp) + (1 | site)"
mynb(p)
#> count ~ mined + offset(Wtemp)
#> <environment: 0x000000001cc24060>
mynb2(p)
#> count ~ mined + offset(Wtemp)
#> <environment: 0x000000001cc87040>

p <- "time | cens(censored) ~ age * sex + disease + (1|patient)"
mynb(p)
#> time | cens(censored) ~ age * sex + disease
#> <environment: 0x000000001cd41310>
mynb2(p)
#> time | cens(censored) ~ age * sex + disease
#> <environment: 0x000000001cda5cc0>

p <- "success | trials(ntrials) ~ x + (1 | patient)"
mynb(p)
#> success | trials(ntrials) ~ x
#> <environment: 0x000000001ce64348>
mynb2(p)
#> success | trials(ntrials) ~ x
#> <environment: 0x000000001cec36a8>

Created on 2020-07-12 by the reprex package (v0.3.0)

strengejacke commented 4 years ago

I think the regex pattern still fails for an edge-case (parenthesis inside RE parenthesis):

m <- lmer(
  log(Reaction) ~ Days + I(Days^2) + (1 + Days + exp(Days) | Subject),
  data = sleepstudy
)

while nobars() on the RHS still works fine. Furthermore, benchmarks using microbenchmark showed that the fastest regex-implementation I found was still twice as long as calling nobars() (though we're talking of nano-seconds here). So finally I'm indeed using nobars.

Strange enough that it took me one year to realize that the solution is simply restricting nobars() to the RHS of the formula... 😖 🙈