Closed ThomasWolf0701 closed 3 years ago
dalex
operates on pandas
; thus, new_observation
is converted into pandas.DataFrame
. This makes a difference in performance when using frameworks with dedicated data formats (e.g. h2o
has H2OFrame
and xgboost
has DMatrix
), but is a necessary effort to make the unified interface.
I am not sure what do you mean by "data frame", and how is "numpy ndarray" different from "matrix.values", but with some code example I would be able to add the native catboost
predict function into the dalex
package and test it out.
I have just checked that catboost
works with numpy
and pandas
input. Also, the Pool
class has no method which transforms it into numpy
/pandas
(can have categorical values); thus, implementing the conversion is probably out of scope.
Closing this, since the title of the issue is an intended dalex
behaviour.
A Dalex explainer takes only numpy.ndarray or pandas.Series or pandas.DataFrame
Which would normally not be a problem but I did a speed test with catboost predict and this actually slows the prediciton down. have not tested with other classifiers yet:
With a data frame: test = featureMatrix timeit.timeit(lambda: bestModel.predict(test)[:,1], number=10) 0.4796263000007457
With a numpy ndarray: test = featureMatrix.to_numpy(np.ndarray) timeit.timeit(lambda: bestModel.predict(test)[:,1], number=10) 4.818123199999718
Those i can pass but with extracting the values from the data frame first:
test = featureMatrix.values timeit.timeit(lambda: bestModel.predict(test)[:,1], number=10) 0.20195280000007187
And the last one i can´t pass to the dx explainer predict parts and it seems to give more than a two times speed up on a data frame with around 2000 columns and also around 2000 rows.
Does this make any difference ?