Open utterances-bot opened 4 years ago
Hi Max, good article ! Agreed most of the PyData stack was designed for batch processing from the start. I think it's actually good to rethink what we consider as the de facto way of dealing with Machine Learning.
You might want have a look at sklearn-ONNX by Xavier Duprès, which I mentionned in this article.
Hey Rémi. Thanks for te reference, I'll definitely check it out. I might even benchmark it against a pure-Python equivalent such as creme, as well as Tensorflow Serving.
Hi Max, I have similar doubt for multi-class (150 classes) text classification. And in real time for single example its very slow.
def create_pipe(clf):
column_trans = ColumnTransformer([('Text', TfidfVectorizer(), 'Message')],remainder='drop')
pipeline = Pipeline([('prep',column_trans), ('clf', clf)])
return pipeline
def fit_and_print(pipeline):
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=le.classes_, digits=3))
clf = OneVsOneClassifier(LinearSVC(random_state=42, class_weight='balanced'))
pipeline = create_pipe(clf)
fit_and_print(pipeline)
Predicting on single example
def create_test_data(x):
d = {'Message' : x}
df = pd.DataFrame(d, index=[0])
return df
revs=[]
for idx in [948, 5717, 458]:
cur = test.loc[idx, 'UTTERANCE']
revs.append(cur)
for rev in revs:
c_res = pipeline.predict(create_test_data(rev))
print(rev, '=', labels[c_res[0]])
How can I optimize it?
@pratikchhapolika you should try using a OneVsRestClassifier
with so many classes, it'll be faster than OneVsOneClassifier
. Also maybe scikit-learn-intelex will give you a boost.
@pratikchhapolika you should try using a
OneVsRestClassifier
with so many classes, it'll be faster thanOneVsOneClassifier
. Also maybe scikit-learn-intelex will give you a boost.
The model is in production and is there a hack with this current setup only?
Sorry but nothing comes to mind 🤷
Sorry but nothing comes to mind 🤷
Something with your approach?
Well sure, you could use the approach I suggest in my article. You could have to dive into OneVsOneClassifier
and LinearSVC
's predict
methods to see if there any overhead to shave off. But I'm doubtful, and my gut feeling is that OneVsOneClassifier
is very expensive with 150 classes in any case.
You could also convert the model to a pure Python function, like I did in naked
. But it requires bespoke implementations for each scikit-learn estimator. Alas I haven't done implementations for OneVsOneClassifier
, LinearSVC
, and ColumnTransformer
. You could contribute them though :)
Speeding up scikit-learn for single predictions - Max Halford
It is now common practice to train machine learning models offline before putting them behind an API endpoint to serve predictions. Specifically, we want an API route which can make a prediction for a single row/instance/sample/data point/individual (call it what you want). Nowadays, we have great tools to do this that care of the nitty-gritty details, such as Cortex, MLFlow, Kubeflow, and Clipper. There are also paid services that hold your hand a bit more, such as DataRobot, H2O, and Cubonacci.
https://maxhalford.github.io/blog/speeding-up-sklearn-single-predictions/