Closed mdancho84 closed 3 years ago
When you include interactions, they do really well. See example below with interaction added.
parsnip
/ workflows
handle the s()
function. A quick example of mgcv::gam()
on time series data with and without interactions. We won't always get this good of results, but the GAMs can be poweful.
library(mgcv)
library(modeltime)
library(tidymodels)
library(tidyverse)
library(timetk)
library(lubridate)
data_prepared_tbl <- m750 %>%
mutate(
date_month = month(date),
date_num = as.numeric(date),
lag_24 = lag(value, 24)
)
splits <- time_series_split(data_prepared_tbl, assess = "2 years", cumulative = TRUE)
# NO INTERACTION ----
model_fit_gam <- gam(
formula = value ~ s(date_month, k = 12) + s(date_num) + s(lag_24),
data = training(splits)
)
model_fit_gam
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> value ~ s(date_month, k = 12) + s(date_num) + s(lag_24)
#>
#> Estimated degrees of freedom:
#> 10.39 8.59 4.52 total = 24.5
#>
#> GCV score: 44025.9
preds_train <- predict(model_fit_gam, newdata = training(splits)) %>% as.numeric()
preds_test <- predict(model_fit_gam, newdata = testing(splits)) %>% as.numeric()
bind_rows(
training(splits),
testing(splits) %>% add_column(predictions = preds_test)
) %>%
select(id, date, value, predictions) %>%
pivot_longer(
cols = c(value, predictions),
names_to = "name",
values_to = "value"
) %>%
mutate(name = as_factor(name)) %>%
plot_time_series(
date, value, .color_var = name, .smooth = F
)
# WITH INTERACTION ----
model_fit_gam <- gam(
formula = value ~ s(date_month, k = 12) + s(date_num) + s(lag_24) + s(date_num, date_month),
data = training(splits)
)
model_fit_gam
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> value ~ s(date_month, k = 12) + s(date_num) + s(lag_24) + s(date_num,
#> date_month)
#>
#> Estimated degrees of freedom:
#> 10.11 1.00 5.06 22.01 total = 39.18
#>
#> GCV score: 23635.4
preds_train <- predict(model_fit_gam, newdata = training(splits)) %>% as.numeric()
preds_test <- predict(model_fit_gam, newdata = testing(splits)) %>% as.numeric()
bind_rows(
training(splits),
testing(splits) %>% add_column(predictions = preds_test)
) %>%
select(id, date, value, predictions) %>%
pivot_longer(
cols = c(value, predictions),
names_to = "name",
values_to = "value"
) %>%
mutate(name = as_factor(name)) %>%
plot_time_series(
date, value, .color_var = name, .smooth = F
)
Created on 2021-03-19 by the reprex package (v1.0.0)
Adding this comment for posterity:
What I like about GAMs also is that they are very different to a tree-based approach, e.g. RF, XGB so should be great in an ensemble. And unlike linear regression, as with GLMs, you can assume different distributions for the errors, not just gaussian, but also binomial, poisson and gamma. For errors, read "outcomes", i.e. different distributions for the outputs... Since time series are often non-negative in nature, such as financial price series, the symmetrical assumption of gaussian/normal errors is not appropriate. In my work, we've used GLM/GAMs a bit for insurance pricing, and it's common to model the frequency / counts of claims using a poisson error distribution, and the severity / size of claims using a gamma error distribution, which is bit like a lognormal distribution. You also transform predictions to a log scale using a "link" function. Logistic regression is really just a GLM with a binomial error distribution plus a "squashing function" to convert predictions to a probability, i.e. on a [0,1] scale. Called a logistic link function, the one you also see in neural nets. But a GLM is restricted to linear, monotonic relationships. If you want to model non-linearity, e.g. a U-shape relationship between age and insurance risk, say, or indeed any real world phenomena, then Logistic GAMs, and GAMs in general with other error distributions such as Poisson, Gamma, etc,,, take things up a level. GAMs = "GLMS on steroids", you might say! Maybe you knew most of this already, but if not, then it's nice to return the favour once since I've learned so much from you!
Hi @mdancho84 ,
I have tried creating this model to see how it affected the s() function and have not encountered any problems at least using the "formula" parsnip interface. I attach a proof of concept where the results can be seen.
Regards,
Very nice!! 👍
Awesome Alberto! Thanks for working on this.
Note - if you replace a smooth s() with a tensor ti() for the interaction you might get a better performance...
While s() is good for variables on the same scale, e.g. geospatial (x, y) coordinates, ti() has an additional smoothing parameter, which helps when the variables involved in the interaction have different scales.
It is also recommended by experts like Noam Ross to use method = "REML" instead of method = "GCV" in the gam() fitting.
Hi @BenWynne-Morris ,
If you want to replace smooth s() with a tensor ti(), it is enough to modify the formula in parsnip::fit() and add in ti() the parameters you want. You can also change the method to use through parsnip::set_engine (and all the parameters of mgcv::gam) as I show in the attached image.
Greetings,
PS: Are you working on this? To be honest I just did this function out of curiosity to see how it affected the s() issue. I don't know what the modeling idea is but if you want me to help you with this for me no problem.
This looks great. What's everyone's thoughts? I'm thinking either:
modeltime.gam
that covers just what we need for regression, or gamsnip
R package. This would cover regression, classification, and multi-class, and be useful for a broader problem set of business beyond just forecasting. Hi,
IMHO, the first option doesn't convince me because I think it could risk accumulating too many packages in the modeltime ecosystem (in that case, wouldn't it be better to include it directly in modeltime together with the rest of the algorithms?) I can understand package size issues but I think an additional package would be reserved for something more general. I think too many overly specific packages can make people not really know what packages are at the end (but this is just my opinion).
So maybe the gamsnip
option is the best. But whatever you say :)
No strong opinion.
OK, I'm leaning towards a new R package that covers GAM models and the mgcv
package. We OK with gamsnip
? If so, I'll create a repo for it.
I’m OK ✅✅
Let’s go with gamsnip
Alright. I will set up tomorrow.
Hi @AlbertoAlmuinha,
For an interactive intro to the theory, you might find this reference helpful: https://m-clark.github.io/generalized-additive-models/case_for_gam.html
I recently took the course provided by Noam Ross (https://noamross.github.io/gams-in-r-course/), which is really amazing.
I'm relatively new to GAMs and package development but will follow along and try and help out where I can, e.g. testing, etc. If you have any questions about GAMs, I'll attempt to fill in the gaps. Note that once you have fitted a gam model object, you can use summary()
, plot()
and predict()
just like with a fitted lm
model object. The broom
package functions such as tidy()
, glance()
and augment()
also work with fitted GAM objects.
Very basic theory: GLMs are a generalisation of linear regression, with two key differences: (1) errors don't have to be normally distributed, e.g. you could have binomial / poisson / gamma errors instead (2) linear predictions can be transformed using another function
Linear regression uses (1) gaussian errors (2) an "identity" transformation function Logistic Regression is also a special case of a GLM with (1) binomial errors (2) a squashing function to transform predictions to a [0,1] range, so they can be interpreted as probabilities.
GAMs are a generalisation of GLMs. The s()
function wrapped around a variable name means that the algorithm has the flexibility to find a non-linear relationship (e.g. U-shape or non-montonic relationship) instead of a linear relationship between predictors and the target. With two variables inside the s()
, you're specifying a 2-d smooth interaction. It's like adding an interaction in linear regression.
The following graph is an example of how the thing we're predicting varies with a 2-d smooth, with confidence intervals added to show the uncertainty of the estimated 2-d smooth surface:
Hey @AlbertoAlmuinha and @BenWynne-Morris I've moved this project to business-science/gamsnip#1
Thank you very much @BenWynne-Morris for the explanation. I found really useful the first link and for sure I will take the Datacamp course. I was familiar with GLM but I had not really heard about GAMs. Your message is really valuable!
Can we close this issue too?
Yes, covered in parsnip now.
Looking for helpers in developing a package related to
modeltime.gam
. May not need to be a new package butmodeltime
is getting quite large.References: