mgcv vs pygam results - Githubissues

ilkot commented 3 years ago

It is more than a question than an issue. I'm trying to get same results with mgcv's and pygam's gam models but results are quite different.

here is the data with 500 observations https://file.io/0gSYwb0hinBI

R:

example_r = read_csv("path/to/data/example_gam.csv")

gam_ex <- gam(y ~ x1+ s(x2,x3,x4,x5,x6),
              family = gaussian,
              method = "GCV.Cp",
              data=example_r)

x1_s = 0
x2_s = -0.46232985
x3_s = 1.5689428
x4_s = -3.852407
x5_s = 8.378359
x6_s = 92.91196

sample <- data.frame(c(x1_s),c(x2_s),c(x3_s),c(x4_s),c(x5_s),c(x6_s))
names(sample) <- c('x1','x2','x3','x4','x5','x6')
predict(gam_ex, sample)

Output: -0.005903112

Python:

example_py = pd.read_csv("path/to/data/example_gam.csv")
X = example_py.drop(['y'],axis=1)
y = example_py['y']

gam = LinearGAM()
gam.fit(X,y)

x1_s = 0
x2_s = -0.46232985
x3_s = 1.5689428
x4_s = -3.852407
x5_s = 8.378359
x6_s = 92.91196

sample = pd.DataFrame(data = [{'x1':x1_s,'x2':x2_s,'x3':x3_s,'x4':x4_s,'x5':x5_s,'x6':x6_s}])
gam.predict(sample)

Output: 0.08549611

I'm suspicious about the formula in R because it is written as x1 + s(x2,...) but in pygam do we have a chance to write like that?

Any comment would be helpful! Thanks in advance!

dswah commented 3 years ago

Hi @ilkot Indeed, i also suspect that differences in the model specification could be the main cause of the discrepancies.

I am not very familiar with the MGCV syntax but i will try to answer.

I think the differences could come from:

model specification
- your mgcv model x1+ s(x2,x3,x4,x5,x6) appears to have a linear term added to a nonlinear function (interaction?) with several terms
- in contrast your pyGAM model is fitting a smoothing function to each dimension of your X data.
- can you elaborate on what the s(x2, x3, ...) is doing?
- if it is indeed an interaction term, you can achieve this with:

from pygam import l, te 

LinearGAM(terms=l(0) + te(1, 2, 3, 4, 5))

mgcv appears to be doing a model-selection/smoothing-selection step (method="GCV.cp")
- but your pyGAM fit is only fitting a single model with the default smoothing. You could try instead to search for a more optimal smoothing parameter:

import numpy as np

# set up a search-space
lam = np.logspace(-3, 5, 50)
lams = [lam] * 2 # here you are specifyng a l2 reg. for your linear term, and a shared smoothing for all dims. of your tensor term

gam.gridsearch(X, y, lam=lams)
gam.summary()

mgcv might use a different default number of smoothing splines.
- pyGAM uses 20 splines for each nonlinear dimension by default

Let me know if this helps!

ilkot commented 3 years ago

Thanks for the detailed answer, appreciate it!

s(x2,x3,..) is the geodimensional s function as shown below for 2 variables and yes it is an interaction term

I tried to fit as you suggest but it throws memory error which is quite interesting because dataset only contains only 500 rows and 7 columns in total

pygam\terms.py", line 1318, in build_penalties
    P = sp.sparse.csc_matrix(np.zeros((self.n_coefs, self.n_coefs)))

MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64

I restart the kernel several times but it didn't change.

pygam: 0.8.0 python. 3.7.6

ilkot commented 3 years ago

@dswah do you know how can I avoid this error?

MRanka29 commented 4 months ago

Hi @ilkot ,

Were you able to resolve the issue of pygam's gam vs mgcv's gam?

ilkot commented 4 months ago

@MRanka29 nope unfortunately.

dswah / pyGAM

mgcv vs pygam results #294