MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
12 stars 5 forks source link

blog/speeding-up-sklearn-single-predictions/ #1

Open utterances-bot opened 4 years ago

utterances-bot commented 4 years ago

Speeding up scikit-learn for single predictions - Max Halford

It is now common practice to train machine learning models offline before putting them behind an API endpoint to serve predictions. Specifically, we want an API route which can make a prediction for a single row/instance/sample/data point/individual (call it what you want). Nowadays, we have great tools to do this that care of the nitty-gritty details, such as Cortex, MLFlow, Kubeflow, and Clipper. There are also paid services that hold your hand a bit more, such as DataRobot, H2O, and Cubonacci.

https://maxhalford.github.io/blog/speeding-up-sklearn-single-predictions/

remiadon commented 4 years ago

Hi Max, good article ! Agreed most of the PyData stack was designed for batch processing from the start. I think it's actually good to rethink what we consider as the de facto way of dealing with Machine Learning.

You might want have a look at sklearn-ONNX by Xavier Duprès, which I mentionned in this article.

MaxHalford commented 4 years ago

Hey Rémi. Thanks for te reference, I'll definitely check it out. I might even benchmark it against a pure-Python equivalent such as creme, as well as Tensorflow Serving.

pratikchhapolika commented 1 year ago

Hi Max, I have similar doubt for multi-class (150 classes) text classification. And in real time for single example its very slow.

def create_pipe(clf):

    column_trans = ColumnTransformer([('Text', TfidfVectorizer(), 'Message')],remainder='drop') 

    pipeline = Pipeline([('prep',column_trans), ('clf', clf)])

    return pipeline
def fit_and_print(pipeline):

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    print(metrics.classification_report(y_test, y_pred, target_names=le.classes_, digits=3))
clf = OneVsOneClassifier(LinearSVC(random_state=42, class_weight='balanced'))
pipeline = create_pipe(clf)
fit_and_print(pipeline)

Predicting on single example

def create_test_data(x):
    d = {'Message' : x}
    df = pd.DataFrame(d, index=[0])
    return df

revs=[]
for idx in [948, 5717, 458]:
   cur = test.loc[idx, 'UTTERANCE']
   revs.append(cur)

for rev in revs:
   c_res = pipeline.predict(create_test_data(rev))
   print(rev, '=', labels[c_res[0]])

How can I optimize it?

MaxHalford commented 1 year ago

@pratikchhapolika you should try using a OneVsRestClassifier with so many classes, it'll be faster than OneVsOneClassifier. Also maybe scikit-learn-intelex will give you a boost.

pratikchhapolika commented 1 year ago

@pratikchhapolika you should try using a OneVsRestClassifier with so many classes, it'll be faster than OneVsOneClassifier. Also maybe scikit-learn-intelex will give you a boost.

The model is in production and is there a hack with this current setup only?

MaxHalford commented 1 year ago

Sorry but nothing comes to mind 🤷

pratikchhapolika commented 1 year ago

Sorry but nothing comes to mind 🤷

Something with your approach?

MaxHalford commented 1 year ago

Well sure, you could use the approach I suggest in my article. You could have to dive into OneVsOneClassifier and LinearSVC's predict methods to see if there any overhead to shave off. But I'm doubtful, and my gut feeling is that OneVsOneClassifier is very expensive with 150 classes in any case.

You could also convert the model to a pure Python function, like I did in naked. But it requires bespoke implementations for each scikit-learn estimator. Alas I haven't done implementations for OneVsOneClassifier, LinearSVC, and ColumnTransformer. You could contribute them though :)