UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.84k stars 2.44k forks source link

Fastapi gunicorn multiple worker deployment #2832

Open AnGrypng opened 2 months ago

AnGrypng commented 2 months ago

Hey everybody,

I want to deploy a sentence encoding model using sentence-transformers. My code looks something like this:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Dict
from sentence_transformers import SentenceTransformer, util
import os
import pandas as pd
import time
import torch
torch.set_num_threads(1)
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

app = FastAPI()

class Payload(BaseModel):
    images: Dict[str, str]
    keywords: str

try:
    device = torch.device('cpu')
    model = SentenceTransformer(os.getenv("MODEL_PATH", "model/"), device=device, trust_remote_code=True)
except Exception as e:
    print(f"Error: {e}")

@app.post("/rank_sentences")
def rank_sentences(payload: Payload):
    try:
        start_time = time.time()
        keywords = payload.keywords
        sentences = list(payload.images.values())

        sentence_embeddings = model.encode(sentences)

i use gunicorn like this

worker_class = "uvicorn.workers.UvicornWorker"
workers = 2
threads = 1
timeout = 60
bind = '0.0.0.0:3000'
loglevel = 'debug'
accesslog = '-'
errorlog = '-'
capture_output = True

The problem that I face is that even though I set gunicorn to two workers, it does not run this model in parallel or concurrently. I thought that by setting two workers fastapi copies application two times into memory and conducts two encodings at the same time. But this does not hold when I look at timings.

I don't think the problem is RAM or CPUs as I have many of them.

Perhaps you have some experience with deploying them to production. Perhaps I'm missing some parameters.

Would be very grateful in advance.

tomaarsen commented 1 month ago

Hello!

Apologies for the delay, I've been recovering from a surgery this last month. I'm not very familiar with gunicorn & FastAPI, so I'm not very sure how best to approach this. That said, I'm aware of the https://github.com/michaelfeil/infinity project which also uses gunicorn and FastAPI. It might either act as inspiration, or perhaps you can use it directly. I believe right now it's 3 projects in once (?), but the infinity_emb here uses FastAPI and gunicorn.