Open stefanocoretta opened 2 years ago
Hi, yes, this is possible!
You can use the following function:
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
#> Gu & Wahba 4 term additive model
b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)
library(insight)
find_variables(b)
#> $response
#> [1] "y"
#>
#> $conditional
#> [1] "x0" "x1" "x2" "x3"
Created on 2022-04-15 by the reprex package (v2.0.1)
Have a look at the docs to see additional customizations you can do with it: https://easystats.github.io/insight/reference/find_variables.html
And also find_smooth()
.
library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)
find_variables(b)
#> $response
#> [1] "y"
#>
#> $conditional
#> [1] "x0" "x1" "x2" "x3"
find_smooth(b)
#> $smooth_terms
#> [1] "s(x1)" "s(x2)"
find_terms(b)
#> $response
#> [1] "y"
#>
#> $conditional
#> [1] "x0" "s(x1)" "s(x2)" "x3"
Created on 2022-04-15 by the reprex package (v2.0.1)
Hi! I am not sure that answers my question.
What I am trying to achieve is returning the variables inside the smooths after finding the smooths.
Pseudo-code example:
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1, x3),data=dat)
smooths <- find_smooth(b)
smooths
#> $smooth_terms
#> [1] "s(x1)" "s(x2)" "s(x1, x3)"
find_vars_from_smooth(smooths)
#> $`s(x1)`
#>[1] "x1"
#>
#>$`s(x2)`
#>[1] "x2"
#>
#>$`s(x1, x3)`
#>[1] "x1" "x3"
ok, then just use clean_names()
on the output of find_smooth()
:
library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)
find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2"
Created on 2022-04-16 by the reprex package (v2.0.1)
Unfortunately, it doesn't work correctly:
library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2),data=dat)
find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"
The third smooth should return c("x1", "x2")
. Have not tried with by
but I assume it would not work correctly either.
@stefanocoretta It should work now:
library(insight)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2), data=dat)
find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1, x2"
d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1, x2"
Created on 2022-06-07 by the reprex package (v2.0.1)
I'm not super-familiar with smooth-terms (I think, @DominiqueMakowski startet using them some time ago), but when is it important to include a variable? E.g. here, should the last line return #> [1] "x1" "x2" "x1"
or #> [1] "x1" "x2" "x1, x2"
?
library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE)
#> [1] "s(x1)" "s(x2)" "s(x1, by = x2, k = -1)"
find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"
Created on 2022-06-07 by the reprex package (v2.0.1)
Mmh I am not sure what's the expected output in this case, last line should probably return "x1, x2" or "x1:x2" or something like that
Looks like none of us are sure about this.
Is there anyone in the team who is expert in GAMs? If not, we can also outsource this to Twitter, where we do know some GAM experts.
Hi! It should return all variables in all cases. And the variables should be different elements.
These are some of the possible scenarios
s(time)
s(longitude, latitude)
s(longitude, latitude, altitude)
s(time, by = factor)
s(time, duration, by = factor)
s(time, factor, bs = "fs")
s(factor, bs = "re")
s(factor, time, bs = "re)
Each of those should return:
"time"
c("longitude", "latitude")
c("longitude", "latitude", "altitude")
c("time", "factor")
c("time", "duration", "factor")
c("time", "factor")
"factor"
c("factor", "time")
That is the necessary format for the variables to be used in predict.gam()
.
@stefanocoretta since clean_names()
returns a character vector, it will only be possible to return e.g "time, factor"
and not c("time", "factor")
. The only way to return c("time", "factor")
would be to change the output format of clean_names()
to output a list instead of a character vector, which would break the existing code using clean_names()
.
@stefanocoretta There's an example of output in #580
It might do although it's a bit inelegant because technically the s()
term can have more than one variable, and I would expect clean_name()
to return those individually.
In order to be able to use the output further I would have to split the output by ,
. Which is ok, although a bit of a hack.
But if that means rewriting the code to accept lists, then your current solution will just do! 😄
It might do although it's a bit inelegant because technically the
s()
term can have more than one variable, and I would expectclean_name()
to return those individually.
But then there could be duplicates if there are several call to s()
in the formula, right? For example, what is the output you would expect for this?
d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()
Maybe we could return a character vector in clean_names()
, instead of a comma-separated char element. Then it's up to the user to do something like
sapply(insight::find_smooth(d, flatten = TRUE), insight::clean_names, simplify = FALSE)
which will give the information @stefanocoretta requested: a named list (with smooth term names), which elements are the variables used.
It might do although it's a bit inelegant because technically the
s()
term can have more than one variable, and I would expectclean_name()
to return those individually.But then there could be duplicates if there are several call to
s()
in the formula, right? For example, what is the output you would expect for this?d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat) find_smooth(d, flatten = TRUE) |> clean_names()
Correct, they should be reduplicated, because to predict stuff you need to know which smooths have with variables (especially when excluding terms while predicting). The mgcv implementation of GAMs is a bit different in structure from most other models.
So ideally I would expect: [1] "x1" "x2" "x1, x2"
. Note that often, when a factor is included as a by
-variable, it is also included as a parametric effect. For example:
gam(y ~ fac + s(x) + s(x, by = fac))
Hello! Thanks for this package, it's so great!
I have a question about GAMs with mgcv.
I wonder if there is a function to programmatically find variables based on smooth term strings (without having to regex the string).
For example:
there are four smooth terms and I would like to be able to extract the variables in the terms, so that for example from
"s(x0)
I get"x0"
and so on (in principle regexing would work, but smooth specifications can get so complicated that it's a bit of a puzzle making sure you get indeed the variable).Is this possible with insight?