Question - Is there a way to find variables for smooth components in mgcv::gam?

stefanocoretta commented 2 years ago

Hello! Thanks for this package, it's so great!

I have a question about GAMs with mgcv.

I wonder if there is a function to programmatically find variables based on smooth term strings (without having to regex the string).

For example:

library(mgcv)
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

there are four smooth terms and I would like to be able to extract the variables in the terms, so that for example from "s(x0) I get "x0" and so on (in principle regexing would work, but smooth specifications can get so complicated that it's a bit of a puzzle making sure you get indeed the variable).

Is this possible with insight?

IndrajeetPatil commented 2 years ago

Hi, yes, this is possible!

You can use the following function:

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
#> Gu & Wahba 4 term additive model
b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

library(insight)
find_variables(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0" "x1" "x2" "x3"

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

IndrajeetPatil commented 2 years ago

Have a look at the docs to see additional customizations you can do with it: https://easystats.github.io/insight/reference/find_variables.html

strengejacke commented 2 years ago

And also find_smooth().

strengejacke commented 2 years ago

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)

find_variables(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0" "x1" "x2" "x3"

find_smooth(b)
#> $smooth_terms
#> [1] "s(x1)" "s(x2)"

find_terms(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0"    "s(x1)" "s(x2)" "x3"

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

stefanocoretta commented 2 years ago

Hi! I am not sure that answers my question.

What I am trying to achieve is returning the variables inside the smooths after finding the smooths.

Pseudo-code example:

b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1, x3),data=dat)

smooths <- find_smooth(b)
smooths
#> $smooth_terms
#> [1] "s(x1)" "s(x2)" "s(x1, x3)"

find_vars_from_smooth(smooths)
#> $`s(x1)`
#>[1] "x1"
#>
#>$`s(x2)`
#>[1] "x2"
#>
#>$`s(x1, x3)`
#>[1] "x1" "x3"

strengejacke commented 2 years ago

ok, then just use clean_names() on the output of find_smooth():

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2"

^{Created on 2022-04-16 by the reprex package (v2.0.1)}

stefanocoretta commented 2 years ago

Unfortunately, it doesn't work correctly:

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2),data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"

The third smooth should return c("x1", "x2"). Have not tried with by but I assume it would not work correctly either.

etiennebacher commented 2 years ago

@stefanocoretta It should work now:

library(insight)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2), data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1"     "x2"     "x1, x2"

d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2, k = -1), data=dat)

find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1"     "x2"     "x1, x2"

^{Created on 2022-06-07 by the reprex package (v2.0.1)}

strengejacke commented 2 years ago

I'm not super-familiar with smooth-terms (I think, @DominiqueMakowski startet using them some time ago), but when is it important to include a variable? E.g. here, should the last line return #> [1] "x1" "x2" "x1" or #> [1] "x1" "x2" "x1, x2"?

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model

d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,by = x2, k = -1), data=dat)

find_smooth(d, flatten = TRUE)
#> [1] "s(x1)"                  "s(x2)"                  "s(x1, by = x2, k = -1)"

find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"

^{Created on 2022-06-07 by the reprex package (v2.0.1)}

DominiqueMakowski commented 2 years ago

Mmh I am not sure what's the expected output in this case, last line should probably return "x1, x2" or "x1:x2" or something like that

IndrajeetPatil commented 2 years ago

Looks like none of us are sure about this.

Is there anyone in the team who is expert in GAMs? If not, we can also outsource this to Twitter, where we do know some GAM experts.

stefanocoretta commented 2 years ago

Hi! It should return all variables in all cases. And the variables should be different elements.

These are some of the possible scenarios

s(time)
s(longitude, latitude)
s(longitude, latitude, altitude)
s(time, by = factor)
s(time, duration, by = factor)
s(time, factor, bs = "fs")
s(factor, bs = "re")
s(factor, time, bs = "re)

Each of those should return:

"time"
c("longitude", "latitude")
c("longitude", "latitude", "altitude")
c("time", "factor")
c("time", "duration", "factor")
c("time", "factor")
"factor"
c("factor", "time")

That is the necessary format for the variables to be used in predict.gam().

etiennebacher commented 2 years ago

@stefanocoretta since clean_names() returns a character vector, it will only be possible to return e.g "time, factor" and not c("time", "factor"). The only way to return c("time", "factor") would be to change the output format of clean_names() to output a list instead of a character vector, which would break the existing code using clean_names().

etiennebacher commented 2 years ago

@stefanocoretta There's an example of output in #580

stefanocoretta commented 2 years ago

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

In order to be able to use the output further I would have to split the output by ,. Which is ok, although a bit of a hack.

But if that means rewriting the code to accept lists, then your current solution will just do! 😄

etiennebacher commented 2 years ago

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

But then there could be duplicates if there are several call to s() in the formula, right? For example, what is the output you would expect for this?

d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()

strengejacke commented 2 years ago

Maybe we could return a character vector in clean_names(), instead of a comma-separated char element. Then it's up to the user to do something like

sapply(insight::find_smooth(d, flatten = TRUE), insight::clean_names, simplify = FALSE)

which will give the information @stefanocoretta requested: a named list (with smooth term names), which elements are the variables used.

stefanocoretta commented 2 years ago

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

But then there could be duplicates if there are several call to s() in the formula, right? For example, what is the output you would expect for this?
d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()

Correct, they should be reduplicated, because to predict stuff you need to know which smooths have with variables (especially when excluding terms while predicting). The mgcv implementation of GAMs is a bit different in structure from most other models.

So ideally I would expect: [1] "x1" "x2" "x1, x2". Note that often, when a factor is included as a by-variable, it is also included as a parametric effect. For example:

gam(y ~ fac + s(x) + s(x, by = fac))

easystats / insight

Question - Is there a way to find variables for smooth components in mgcv::gam? #553