SauceCat / PDPbox

python partial dependence plot toolbox
http://pdpbox.readthedocs.io/en/latest/
MIT License
840 stars 129 forks source link

How to train titanic_model #52

Open szz01 opened 5 years ago

szz01 commented 5 years ago

When i run with own data set,I get the following error: AttributeError Traceback (most recent call last)

in 4 feature='sex', 5 feature_name='Gender', ----> 6 predict_kwds={} 7 ) /opt/anaconda2/envs/python35/lib/python3.5/site-packages/pdpbox/info_plots.py in actual_plot(model, X, feature, feature_name, num_grid_points, grid_type, percentile_range, grid_range, cust_grid_points, show_percentile, show_outliers, endpoint, which_classes, predict_kwds, ncols, figsize, plot_params) 289 # make predictions 290 # info_df only contains feature value and actual predictions --> 291 prediction = predict(X, **predict_kwds) 292 info_df = X[_make_list(feature)] 293 actual_prediction_columns = ['actual_prediction'] /opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features) 1282 1283 if validate_features: -> 1284 self._validate_features(data) 1285 1286 length = c_bst_ulong() /opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in _validate_features(self, data) 1669 """ 1670 if self.feature_names is None: -> 1671 self.feature_names = data.feature_names 1672 self.feature_types = data.feature_types 1673 else: /opt/anaconda2/envs/python35/lib/python3.5/site-packages/pandas/core/generic.py in __getattr__(self, name) 5065 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5066 return self[name] -> 5067 return object.__getattribute__(self, name) 5068 5069 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'feature_names' so i want to know how to train the titanic_model in the example. Thank for you advice.
dyerrington commented 5 years ago

Looks like you're referencing an attribute that doesn't exist in your dataframe @szz01. Why don't you post your full code example?

ivan-marroquin commented 4 years ago

Hi @dyerrington

I have the same issue with PDPpox version 0.2.0. I am using Python 3.6.5 on a windows machine.

The classifier was generated using xgboost 0.90 with command XGBClassifier and to fit the classifier, I used Python arrays (the same data set is part of the attached zip file).

The attached a zip file contains a Python script and its input data necessary to duplicate the incident.

Many thanks, Ivan

testing_pdpbox.zip

ivan-marroquin commented 4 years ago

Hi there,

I was wondering if someone had the opportunity to look into this issue.

Many thanks,

Ivan

SauceCat commented 4 years ago

@ivan-marroquin can you put your error messages here?

ivan-marroquin commented 4 years ago

Hi @SauceCat

As per your request:

pdpbox_interaction= pdp.pdp_interact(model= best_trained_model, dataset= pd_test_inputs, model_features= feature_names, features= features_to_plot)

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp.py", line 558, in pdp_interact n_jobs=n_jobs, predict_kwds=predict_kwds, data_transformer=data_transformer)

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp.py", line 159, in pdp_isolate for feature_grid in feature_grids)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 921, in call if self.dispatch_one_batch(iterator):

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib_parallel_backends.py", line 549, in init self.results = batch()

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 225, in call for func, args, kwargs in self.items]

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 225, in for func, args, kwargs in self.items]

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp_calc_utils.py", line 44, in _calc_ice_lines preds = predict(_data[model_features], **predict_kwds)

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1284, in predict self._validate_features(data)

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1675, in _validate_features if self.feature_names != data.feature_names:

File "c:\temp\python\python3.6.5\lib\site-packages\pandas\core\generic.py", line 5180, in getattr return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'feature_names'

<Figure size 3600x2400 with 0 Axes>

Many thanks, Ivan

dyerrington commented 4 years ago

To me, @ivan-marroquin , the error is descriptive:

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1675, in _validate_features
if self.feature_names != data.feature_names:

File "c:\temp\python\python3.6.5\lib\site-packages\pandas\core\generic.py", line 5180, in getattr
return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'feature_names'

The part of the code from xgboost that throws this error is this:

Line ~1675 of xgboost/core.py

    def _validate_features(self, data):
        """
        Validate Booster and data's feature_names are identical.
        Set feature_names and feature_types from DMatrix
        """
        if self.feature_names is None:
            self.feature_names = data.feature_names
            self.feature_types = data.feature_types
        else:
            # Booster can't accept data with different feature names
            if self.feature_names != data.feature_names:
                dat_missing = set(self.feature_names) - set(data.feature_names)
                my_missing = set(data.feature_names) - set(self.feature_names)

                msg = 'feature_names mismatch: {0} {1}'

                if dat_missing:
                    msg += ('\nexpected ' + ', '.join(str(s) for s in dat_missing) +
                            ' in input data')

                if my_missing:
                    msg += ('\ntraining data did not have the following fields: ' +
                            ', '.join(str(s) for s in my_missing))

                raise ValueError(msg.format(self.feature_names,
                                            data.feature_names))

xgboost is trying to make sure the data that the model is derived from matches the data frame in reference -- as far as I can tell. When the original object (data in this case) doesn't have an attribute, .feature_names, the original DataFrame type object throws the final error.

The first thing I would check is that the model you've trained matches the data you are trying to plot. I would double-check everything including the encoding of feature names. Assert that they match 100% before doing anything with PDP then fix any problems. If it fails, absolutely reduce the problem and re-revaluate. Try building a model with fewer features and a very small number of observations so that it trains in seconds or milliseconds, then try to get it to work in the same file or in a notebook environment without doing any encoding or decoding / serialization of models.