dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python
https://pygam.readthedocs.io
Apache License 2.0
875 stars 160 forks source link

Interactions between features in the GAM #194

Closed marknormanread closed 6 years ago

marknormanread commented 6 years ago

Nowhere obvious to leave this, so here we are. You asked for feedback. I've been using pyGAM for a while now, on and off, as it fits better into my workflow than R does.

One thing that isn't clear to me is how you instruct the GAM to consider interactions between features. In R's "mgcv" package you build a GAM something like this:

mod <- gam(response ~ s(A, k=k1, bs="tp")
           +s(B, k=k1, bs="tp")
           +s(A,B, k=k2, bs="tp"),                  
           data=training)

A and B must be columns in the training dataframe. The +s(A,B, k=k2, bs="tp") indicates you want to look not just at A and B in isolation, but interactions between them. I'm no expert in GAMs, and hence I can't say how that interaction is actually handled.

summary(mod) describes the strength of the interaction, for instance, the s(A,B) 11.757 12.000 563.092 <2e-16 *** line in this:

Family: Beta regression(10.739) 
Link function: logit 

Formula:
eval(parse(text = responseName)) ~ s(A, k = k1, bs = "tp") + 
    s(B, k = k1, bs = "tp") + s(A, B, k = k2, bs = "tp")
<environment: 0x7fe9ac5898b0>

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.88340    0.04347   66.33   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
          edf Ref.df  Chi.sq p-value    
s(A)    6.585  8.109 810.041  <2e-16 ***
s(B)    1.006  1.010   4.917   0.027 *  
s(A,B) 11.757 12.000 563.092  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.934   Deviance explained = 98.1%
-REML = -2685.5  Scale est. = 1         n = 400

My feeling is that just supplying the features to pyGAM does not induce a modelling of interactions between them. I am doing this to build heatmaps of a response to two features (hence, 3D data rather than the 2D response-feature line graphs you have in your examples). My heatmaps looked "blocky", with contours on the heatmap being almost entirely vertical and horizontal. When I add additional features, multiplying the first two features together, or subtracting one from the other (hence crudely defining interactions) my heatmaps looked much more fluid.

So, long story short, can you instruct pyGAM to look for interactions in the same way that R does, rather than the naive method I attempted (adding extra features)? If this is already implemented, could you add an example to show other users?

Many thanks for your efforts on this project.

dswah commented 6 years ago

@marknormanread thanks very much for leaving feedback. it is very valuable to me to hear from users.

currently, there is no way to easily build feature interactions in pyGAM.

but that is about to change with https://github.com/dswah/pyGAM/pull/169.

there are a few details to fill in, but the core functionality is there.

i really hope to finish this in the next 2 weeks.

i will let you know when it's ready.

dswah commented 6 years ago

@marknormanread this is ready! you can now specify interactions between features using tensor products.

for example, if you want an interaction between feature 0 and 1, and a spline on feature 2 you could do

from pygam import LinearGAM, s, te
LinearGAM(te(0,1) + s(2)).fit(X, y)