UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Can anyone please guide for my implementation of flask API server with sentence-transformers model? #2409

Open AayushSameerShah opened 7 months ago

AayushSameerShah commented 7 months ago

👋🏻 Hello, I know it might sound a little silly to ask, but I am working on a project in which I use flask as the API server and use sentence-transformers/all-MiniLM-L6-v2 as the model for the similarity check.

👨🏻‍💻 Structure

On the high-level I have the following structure:

app.py

@app.route("/api/match")
def match():
    top_3_matches = functions.match("QUERY")

### other code ###

def main():
    app.run(host="0.0.0.0", threaded=False, debug=True, use_reloader=False)

if __name__ == "__main__":
    main()

And in the functions.py

from sentence_transformers import SentenceTransformer, util

sentence_transformers_model = SentenceTransformer("all-MiniLM-L6-v2")

def match(query):
    ### match code ###
    return top_3

🤔 My question

Currently it is working just fine with multiple users, but I am willing to know if it is the standard approach, or anything needs to be changed.


Please guide me on this, Thank you 🙏🏻

tomaarsen commented 7 months ago

Hello!

Although I've used it quite a bit, I'm no flask expert by any means, so take my advice with a grain of salt.

  1. It looks okay to me.
  2. That depends on the flask and WSGI gateway (e.g. gunicorn) configuration. I believe that with threaded=False, everything might just be handled sequentially. In practice, you'll run a flask app with e.g. gunicorn with some amount of workers. For each worker, the model would be initialized fresh. This might cause memory issues. Look for recommendations here. This SO post is also useful.
  3. Perhaps, you can use ONNX to speed up processing, but it might be more hastle than it's worth. There's documentation on that here. Other than that, I'm not experienced enough with Flask & gunicorn to be able to suggest other optimizations.