TypeError: new_observation must be a numpy.ndarray or pandas.Series or pandas.DataFrame

ModelOriented / DALEX

moDel Agnostic Language for Exploration and eXplanation

https://dalex.drwhy.ai

GNU General Public License v3.0

1.38k stars 166 forks source link

TypeError: new_observation must be a numpy.ndarray or pandas.Series or pandas.DataFrame #365

Closed ThomasWolf0701 closed 3 years ago

ThomasWolf0701 commented 3 years ago

A Dalex explainer takes only numpy.ndarray or pandas.Series or pandas.DataFrame

Which would normally not be a problem but I did a speed test with catboost predict and this actually slows the prediciton down. have not tested with other classifiers yet:

With a data frame: test = featureMatrix timeit.timeit(lambda: bestModel.predict(test)[:,1], number=10) 0.4796263000007457

With a numpy ndarray: test = featureMatrix.to_numpy(np.ndarray) timeit.timeit(lambda: bestModel.predict(test)[:,1], number=10) 4.818123199999718

Those i can pass but with extracting the values from the data frame first:

test = featureMatrix.values timeit.timeit(lambda: bestModel.predict(test)[:,1], number=10) 0.20195280000007187

And the last one i can´t pass to the dx explainer predict parts and it seems to give more than a two times speed up on a data frame with around 2000 columns and also around 2000 rows.

Does this make any difference ?

hbaniecki commented 3 years ago

dalex operates on pandas; thus, new_observation is converted into pandas.DataFrame. This makes a difference in performance when using frameworks with dedicated data formats (e.g. h2o has H2OFrame and xgboost has DMatrix), but is a necessary effort to make the unified interface. I am not sure what do you mean by "data frame", and how is "numpy ndarray" different from "matrix.values", but with some code example I would be able to add the native catboost predict function into the dalex package and test it out.

hbaniecki commented 3 years ago

I have just checked that catboost works with numpy and pandas input. Also, the Pool class has no method which transforms it into numpy/pandas (can have categorical values); thus, implementing the conversion is probably out of scope.

Closing this, since the title of the issue is an intended dalex behaviour.