Closed jrichardhu closed 6 years ago
@jrichardhu thank you so much for the advice, I will look at it. :smile_cat:
@jrichardhu I am facing the same "ValueError: cannot reindex from a duplicate axis" issue when using pdpbox.pdp.pdp_plot, though I tried to deduplicate train_X with df.drop_duplicates(). Have you manage to implement a working solution for this problem? Thanks for your help!
This is how I plotted one variable (pdp_var). Try to see if something similar to this works and if not, perhaps try lowering the value of num_grid_points.
dedup_df = X_Train_all.drop_duplicates(subset=pdp_var)
pdp_ur = pdp.pdp_isolate(clf, dedup_df, pdp_var,
num_grid_points=15,
percentile_range=(5, 95))
pdp.pdp_plot(pdp_ur, pdp_var,
center=True, plot_org_pts=True,
plot_lines=True, frac_to_plot=0.5,
figsize=(10, 10))
Thank you @jrichardhu, your example of code and advice on lowering num_grid_points made my code worked perfectly!
I just recently started to use this excellent repository to fill in a much needed gap in scikit learn. A suggestion for clarity in the parameters of pdpbox.pdp.pdp_isolate is to require train_X to be a deduplicated pandas dataframe because it caused a bit of confusion on my part when I wasn't able to plot due to the indexing issues from duplicated values. It's really just as simple as df.drop_duplicates(). Thanks for all of your work!
EDIT:
Another data checking step should be added at line 303 in pdp.py for using pdp.pdp_interact. If the feature grids are not specified and are defaulted to 10 and train_X.shape[0] is less than 100, then you will have an error on line 305 since data_chunk_size will round to 0. I just need to specify that num_grid_points=[5,5] so that it would run when train_X.shape[0] = 25.