dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python
https://pygam.readthedocs.io
Apache License 2.0
857 stars 157 forks source link

Problem with partial dependence and categories #301

Open jonathan-taylor opened 2 years ago

jonathan-taylor commented 2 years ago

It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence:

import numpy as np
from pygam import LinearGAM, s, f

X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)

G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)

This works fine, but:

X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)

raises the following:

ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]

Issue is that check_X looks at categorical of the formed _modelmat which has 0s everywhere but the term's column. Really, _modelmat just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...

jonathan-taylor commented 2 years ago

See https://github.com/dswah/pyGAM/pull/302

jonathan-taylor commented 2 years ago

An issue with this fix is that the standard error of the bars will depend on where we evaluate. Might be better to return \hat{\mu}(X_grid)-\hat{\mu}(\bar{X}). So it would be evaluated along a line through \bar{X}.

5ch0r5ch1 commented 6 months ago

It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence:

import numpy as np
from pygam import LinearGAM, s, f

X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)

G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)

This works fine, but:

X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)

raises the following:

ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]

Issue is that check_X looks at categorical of the formed _modelmat which has 0s everywhere but the term's column. Really, _modelmat just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...

@jonathan-taylor does it occur to you that G2.partial_dependence(2) is actually working and only G2.partial_dependence(0) and G2.partial_dependence(1) not, which means the issue which caused by X2[:,2] is affecting X2[:,0] and X2[:,1]? How can we explain this?

nickeubank commented 5 months ago

Yeah, I see — I'm getting this too. The problem emerges, I think, because evaluation of the zeros that get filled in (and I think pyGAM is assuming are the omitted category) for partial_dependence for any other feature are un-evaluable