business-science / modeltime

Modeltime unlocks time series forecast models and machine learning in one framework
https://business-science.github.io/modeltime/
Other
538 stars 84 forks source link

GAMs - mgcv modeltime integration #71

Closed mdancho84 closed 3 years ago

mdancho84 commented 3 years ago

Looking for helpers in developing a package related to modeltime.gam. May not need to be a new package but modeltime is getting quite large.

References:

mdancho84 commented 3 years ago

Why We Like GAMs

When you include interactions, they do really well. See example below with interaction added.

Challenges

Example - No Interaction vs Interaction

A quick example of mgcv::gam() on time series data with and without interactions. We won't always get this good of results, but the GAMs can be poweful.

library(mgcv)
library(modeltime)
library(tidymodels)
library(tidyverse)
library(timetk)
library(lubridate)

data_prepared_tbl <- m750 %>%
    mutate(
        date_month = month(date),
        date_num   = as.numeric(date),
        lag_24     = lag(value, 24)
    ) 

splits <- time_series_split(data_prepared_tbl, assess = "2 years", cumulative = TRUE)

# NO INTERACTION ----

model_fit_gam <- gam(
    formula = value ~ s(date_month, k = 12) + s(date_num) + s(lag_24), 
    data    = training(splits)
)
model_fit_gam
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> value ~ s(date_month, k = 12) + s(date_num) + s(lag_24)
#> 
#> Estimated degrees of freedom:
#> 10.39  8.59  4.52  total = 24.5 
#> 
#> GCV score: 44025.9

preds_train <- predict(model_fit_gam, newdata = training(splits)) %>% as.numeric()
preds_test  <- predict(model_fit_gam, newdata = testing(splits)) %>% as.numeric()

bind_rows(
    training(splits),
    testing(splits) %>% add_column(predictions = preds_test)
) %>%
    select(id, date, value, predictions) %>%
    pivot_longer(
        cols = c(value, predictions),
        names_to  = "name",
        values_to = "value"
    ) %>%
    mutate(name = as_factor(name)) %>%
    plot_time_series(
        date, value, .color_var = name, .smooth = F
    )

# WITH INTERACTION ----

model_fit_gam <- gam(
    formula = value ~ s(date_month, k = 12) + s(date_num) + s(lag_24) + s(date_num, date_month), 
    data    = training(splits)
)
model_fit_gam
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> value ~ s(date_month, k = 12) + s(date_num) + s(lag_24) + s(date_num, 
#>     date_month)
#> 
#> Estimated degrees of freedom:
#> 10.11  1.00  5.06 22.01  total = 39.18 
#> 
#> GCV score: 23635.4

preds_train <- predict(model_fit_gam, newdata = training(splits)) %>% as.numeric()
preds_test  <- predict(model_fit_gam, newdata = testing(splits)) %>% as.numeric()

bind_rows(
    training(splits),
    testing(splits) %>% add_column(predictions = preds_test)
) %>%
    select(id, date, value, predictions) %>%
    pivot_longer(
        cols = c(value, predictions),
        names_to  = "name",
        values_to = "value"
    ) %>%
    mutate(name = as_factor(name)) %>%
    plot_time_series(
        date, value, .color_var = name, .smooth = F
    )

Created on 2021-03-19 by the reprex package (v1.0.0)

mdancho84 commented 3 years ago

Adding this comment for posterity:

What I like about GAMs also is that they are very different to a tree-based approach, e.g. RF, XGB so should be great in an ensemble. And unlike linear regression, as with GLMs, you can assume different distributions for the errors, not just gaussian, but also binomial, poisson and gamma. For errors, read "outcomes", i.e. different distributions for the outputs... Since time series are often non-negative in nature, such as financial price series, the symmetrical assumption of gaussian/normal errors is not appropriate. In my work, we've used GLM/GAMs a bit for insurance pricing, and it's common to model the frequency / counts of claims using a poisson error distribution, and the severity / size of claims using a gamma error distribution, which is bit like a lognormal distribution. You also transform predictions to a log scale using a "link" function. Logistic regression is really just a GLM with a binomial error distribution plus a "squashing function" to convert predictions to a probability, i.e. on a [0,1] scale. Called a logistic link function, the one you also see in neural nets. But a GLM is restricted to linear, monotonic relationships. If you want to model non-linearity, e.g. a U-shape relationship between age and insurance risk, say, or indeed any real world phenomena, then Logistic GAMs, and GAMs in general with other error distributions such as Poisson, Gamma, etc,,, take things up a level. GAMs = "GLMS on steroids", you might say! Maybe you knew most of this already, but if not, then it's nice to return the favour once since I've learned so much from you!

AlbertoAlmuinha commented 3 years ago

Hi @mdancho84 ,

I have tried creating this model to see how it affected the s() function and have not encountered any problems at least using the "formula" parsnip interface. I attach a proof of concept where the results can be seen.

Regards,

proof_of_concept

mdancho84 commented 3 years ago

Very nice!! 👍

BenWynne-Morris commented 3 years ago

Awesome Alberto! Thanks for working on this. Note - if you replace a smooth s() with a tensor ti() for the interaction you might get a better performance...
While s() is good for variables on the same scale, e.g. geospatial (x, y) coordinates, ti() has an additional smoothing parameter, which helps when the variables involved in the interaction have different scales. It is also recommended by experts like Noam Ross to use method = "REML" instead of method = "GCV" in the gam() fitting.

AlbertoAlmuinha commented 3 years ago

Hi @BenWynne-Morris ,

If you want to replace smooth s() with a tensor ti(), it is enough to modify the formula in parsnip::fit() and add in ti() the parameters you want. You can also change the method to use through parsnip::set_engine (and all the parameters of mgcv::gam) as I show in the attached image.

Greetings,

PS: Are you working on this? To be honest I just did this function out of curiosity to see how it affected the s() issue. I don't know what the modeling idea is but if you want me to help you with this for me no problem.

params

mdancho84 commented 3 years ago

This looks great. What's everyone's thoughts? I'm thinking either:

  1. A new R package called modeltime.gam that covers just what we need for regression, or
  2. Alternatively, we could be a little more ambitious and make a gamsnip R package. This would cover regression, classification, and multi-class, and be useful for a broader problem set of business beyond just forecasting.
AlbertoAlmuinha commented 3 years ago

Hi,

IMHO, the first option doesn't convince me because I think it could risk accumulating too many packages in the modeltime ecosystem (in that case, wouldn't it be better to include it directly in modeltime together with the rest of the algorithms?) I can understand package size issues but I think an additional package would be reserved for something more general. I think too many overly specific packages can make people not really know what packages are at the end (but this is just my opinion).

So maybe the gamsnip option is the best. But whatever you say :)

BenWynne-Morris commented 3 years ago

No strong opinion.

mdancho84 commented 3 years ago

OK, I'm leaning towards a new R package that covers GAM models and the mgcv package. We OK with gamsnip? If so, I'll create a repo for it.

AlbertoAlmuinha commented 3 years ago

I’m OK ✅✅

Let’s go with gamsnip

mdancho84 commented 3 years ago

Alright. I will set up tomorrow.

BenWynne-Morris commented 3 years ago

Hi @AlbertoAlmuinha,

For an interactive intro to the theory, you might find this reference helpful: https://m-clark.github.io/generalized-additive-models/case_for_gam.html

I recently took the course provided by Noam Ross (https://noamross.github.io/gams-in-r-course/), which is really amazing.

I'm relatively new to GAMs and package development but will follow along and try and help out where I can, e.g. testing, etc. If you have any questions about GAMs, I'll attempt to fill in the gaps. Note that once you have fitted a gam model object, you can use summary(), plot() and predict() just like with a fitted lm model object. The broom package functions such as tidy(), glance() and augment() also work with fitted GAM objects.

Very basic theory: GLMs are a generalisation of linear regression, with two key differences: (1) errors don't have to be normally distributed, e.g. you could have binomial / poisson / gamma errors instead (2) linear predictions can be transformed using another function

Linear regression uses (1) gaussian errors (2) an "identity" transformation function Logistic Regression is also a special case of a GLM with (1) binomial errors (2) a squashing function to transform predictions to a [0,1] range, so they can be interpreted as probabilities.

GAMs are a generalisation of GLMs. The s() function wrapped around a variable name means that the algorithm has the flexibility to find a non-linear relationship (e.g. U-shape or non-montonic relationship) instead of a linear relationship between predictors and the target. With two variables inside the s(), you're specifying a 2-d smooth interaction. It's like adding an interaction in linear regression.

The following graph is an example of how the thing we're predicting varies with a 2-d smooth, with confidence intervals added to show the uncertainty of the estimated 2-d smooth surface: image

mdancho84 commented 3 years ago

Hey @AlbertoAlmuinha and @BenWynne-Morris I've moved this project to business-science/gamsnip#1

AlbertoAlmuinha commented 3 years ago

Thank you very much @BenWynne-Morris for the explanation. I found really useful the first link and for sure I will take the Datacamp course. I was familiar with GLM but I had not really heard about GAMs. Your message is really valuable!

AlbertoAlmuinha commented 3 years ago

Can we close this issue too?

mdancho84 commented 3 years ago

Yes, covered in parsnip now.