Not exactly an issue: dedup DataFrame

jrichardhu commented 6 years ago

I just recently started to use this excellent repository to fill in a much needed gap in scikit learn. A suggestion for clarity in the parameters of pdpbox.pdp.pdp_isolate is to require train_X to be a deduplicated pandas dataframe because it caused a bit of confusion on my part when I wasn't able to plot due to the indexing issues from duplicated values. It's really just as simple as df.drop_duplicates(). Thanks for all of your work!

EDIT:

Another data checking step should be added at line 303 in pdp.py for using pdp.pdp_interact. If the feature grids are not specified and are defaulted to 10 and train_X.shape[0] is less than 100, then you will have an error on line 305 since data_chunk_size will round to 0. I just need to specify that num_grid_points=[5,5] so that it would run when train_X.shape[0] = 25.

SauceCat commented 6 years ago

@jrichardhu thank you so much for the advice, I will look at it. :smile_cat:

PierreMegret commented 6 years ago

@jrichardhu I am facing the same "ValueError: cannot reindex from a duplicate axis" issue when using pdpbox.pdp.pdp_plot, though I tried to deduplicate train_X with df.drop_duplicates(). Have you manage to implement a working solution for this problem? Thanks for your help!

jrichardhu commented 6 years ago

This is how I plotted one variable (pdp_var). Try to see if something similar to this works and if not, perhaps try lowering the value of num_grid_points.

dedup_df = X_Train_all.drop_duplicates(subset=pdp_var)
pdp_ur = pdp.pdp_isolate(clf, dedup_df, pdp_var,
                                 num_grid_points=15,
                                 percentile_range=(5, 95))
pdp.pdp_plot(pdp_ur, pdp_var,
                     center=True, plot_org_pts=True,
                     plot_lines=True, frac_to_plot=0.5,
                     figsize=(10, 10))

PierreMegret commented 6 years ago

Thank you @jrichardhu, your example of code and advice on lowering num_grid_points made my code worked perfectly!

SauceCat / PDPbox

Not exactly an issue: dedup DataFrame #4