SauceCat / PDPbox

python partial dependence plot toolbox
http://pdpbox.readthedocs.io/en/latest/
MIT License
844 stars 129 forks source link

Rounding in `get_grid_` causes issues for features with narrow scale #13

Closed flinder closed 6 years ago

flinder commented 6 years ago

If a feature has percentiles that vary by less than 0.01 the generated grid has duplicate values which (for some reason) leads to unequal dimensions of the .feature_grids and .pdp attributes of the pdpbox.pdp.pdp_isolate_obj. When using this development version, the problem with the dimensions is gone, but the grid will have only one value in this example. I could fix the problem for my purposes by just removing the rounding statements in the _get_grids() function (see my fork), but assume there's a reason for the rounding so a real fix is probably more involved (?).

Here's a reproducible example (using the version from pypi):

import pandas as pd
import numpy as np

from pdpbox import pdp # Version from pypi
from sklearn.linear_model import SGDClassifier

np.random.seed(123)
df = pd.DataFrame({'y': np.random.randint(0,2,100), 
                  'x1': np.random.uniform(0.5, 0.5001, 100),
                  'x2': np.random.uniform(0.5, 0.5001, 100)})

clf = SGDClassifier()
X = df[['x1', 'x2']]
y = df['y']
clf.fit(X, y)
P = pdp.pdp_isolate(model=clf, train_X=X, feature='x1')

print(P.feature_grids)
print(P.pdp)
SauceCat commented 6 years ago

Interesting catch.