dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python
https://pygam.readthedocs.io
Apache License 2.0
871 stars 159 forks source link

Categorical Features with no 0 label leads to partial_dependence ValueError #285

Open tatkeller opened 3 years ago

tatkeller commented 3 years ago

Hi there,

I created a model that had categorical features above the value of 0 (range of n to m, where n>0 and m>0). I wanted to plot the partial dependence for my model, but ran into a ValueError (error recreated below). The problem is that generate_X_grid creates a matrix that looks like this:

[[0,0,0, ..., 0, i, 0, ..., 0,0,0],
[0,0,0, ..., 0, i, 0, ..., 0,0,0],
...,
[0,0,0, ..., 0, i, 0, ..., 0,0,0]]

And for models that have been trained with categorical features that do not have '0' as a category, this will raise an error when calling the partial dependence function.

Here is a recreation of the error using the Quick start example code:

Input:

from pygam.datasets import wage

X, y = wage()

from pygam import LinearGAM, s, f

gam = LinearGAM(f(0) + s(1) + f(2)).fit(X, y) ##Use f(0) to make the 0th term categorical. The 0th term contains no value equal to  0

import matplotlib.pyplot as plt

for i, term in enumerate(gam.terms):
    if term.isintercept:
        continue

    XX = gam.generate_X_grid(term=i)
    pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

    #plt.figure()
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(repr(term))
    plt.show()

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-0e5df89ff530> in <module>()
      7     XX = gam.generate_X_grid(term=i)
      8     print(XX)
----> 9     pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)
     10 
     11     #plt.figure()

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/pygam.py in partial_dependence(self, term, X, width, quantiles, meshgrid)
   1542                         features=self.feature, verbose=self.verbose)
   1543 
-> 1544         modelmat = self._modelmat(X, term=term)
   1545         pdep = self._linear_predictor(modelmat=modelmat, term=term)
   1546         out = [pdep]

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/pygam.py in _modelmat(self, X, term)
    455         X = check_X(X, n_feats=self.statistics_['m_features'],
    456                     edge_knots=self.edge_knots_, dtypes=self.dtype,
--> 457                     features=self.feature, verbose=self.verbose)
    458 
    459         return self.terms.build_columns(X, term=term)

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/utils.py in check_X(X, n_feats, min_samples, edge_knots, dtypes, features, verbose)
    301                                      'feature {}. Expected data on [{}, {}], '\
    302                                      'but found data on [{}, {}]'\
--> 303                                      .format(i, min_, max_, x.min(), x.max()))
    304 
    305     return X

ValueError: X data is out of domain for categorical feature 0. Expected data on [2003.0, 2009.0], but found data on [0.0, 0.0]

The versions that I used are: pyGAM=0.8.0 Python=3.6.12

For now I will work around this by subtracting the respective minimum value from each categorical value changing the category range values from (n,m) to (n-n, m-n)==(0,m-n).

Thanks in advance

5ch0r5ch1 commented 9 months ago

See https://github.com/dswah/pyGAM/pull/302