Can anyone please guide for my implementation of flask API server with sentence-transformers model?

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

Apache License 2.0

15.45k stars 2.5k forks source link

👋🏻 Hello, I know it might sound a little silly to ask, but I am working on a project in which I use flask as the API server and use sentence-transformers/all-MiniLM-L6-v2 as the model for the similarity check.

👨🏻‍💻 Structure

On the high-level I have the following structure:

app.py

@app.route("/api/match")
def match():
    top_3_matches = functions.match("QUERY")

### other code ###

def main():
    app.run(host="0.0.0.0", threaded=False, debug=True, use_reloader=False)

if __name__ == "__main__":
    main()

And in the functions.py

from sentence_transformers import SentenceTransformer, util

sentence_transformers_model = SentenceTransformer("all-MiniLM-L6-v2")

def match(query):
    ### match code ###
    return top_3

🤔 My question

Is my implementation "okay"?
Will it work for the parallel processing? Or will it create the separate models and allocate separate resources for each new parallel request?
Can it be optimized? Because I need to serve each new request as it comes, I don't want to create "batches" and process them at once.

Currently it is working just fine with multiple users, but I am willing to know if it is the standard approach, or anything needs to be changed.

Please guide me on this, Thank you 🙏🏻

Hello!

Although I've used it quite a bit, I'm no flask expert by any means, so take my advice with a grain of salt.

It looks okay to me.
That depends on the flask and WSGI gateway (e.g. gunicorn) configuration. I believe that with threaded=False, everything might just be handled sequentially. In practice, you'll run a flask app with e.g. gunicorn with some amount of workers. For each worker, the model would be initialized fresh. This might cause memory issues. Look for recommendations here. This SO post is also useful.
Perhaps, you can use ONNX to speed up processing, but it might be more hastle than it's worth. There's documentation on that here. Other than that, I'm not experienced enough with Flask & gunicorn to be able to suggest other optimizations.

Tom Aarsen

UKPLab / sentence-transformers

Can anyone please guide for my implementation of flask API server with sentence-transformers model? #2409

👨🏻‍💻 Structure

🤔 My question