Fastapi gunicorn multiple worker deployment

Hey everybody,

I want to deploy a sentence encoding model using sentence-transformers. My code looks something like this:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Dict
from sentence_transformers import SentenceTransformer, util
import os
import pandas as pd
import time
import torch
torch.set_num_threads(1)
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

app = FastAPI()

class Payload(BaseModel):
    images: Dict[str, str]
    keywords: str

try:
    device = torch.device('cpu')
    model = SentenceTransformer(os.getenv("MODEL_PATH", "model/"), device=device, trust_remote_code=True)
except Exception as e:
    print(f"Error: {e}")

@app.post("/rank_sentences")
def rank_sentences(payload: Payload):
    try:
        start_time = time.time()
        keywords = payload.keywords
        sentences = list(payload.images.values())

        sentence_embeddings = model.encode(sentences)

i use gunicorn like this

worker_class = "uvicorn.workers.UvicornWorker"
workers = 2
threads = 1
timeout = 60
bind = '0.0.0.0:3000'
loglevel = 'debug'
accesslog = '-'
errorlog = '-'
capture_output = True

The problem that I face is that even though I set gunicorn to two workers, it does not run this model in parallel or concurrently. I thought that by setting two workers fastapi copies application two times into memory and conducts two encodings at the same time. But this does not hold when I look at timings.

I don't think the problem is RAM or CPUs as I have many of them.

Perhaps you have some experience with deploying them to production. Perhaps I'm missing some parameters.

Would be very grateful in advance.

UKPLab / sentence-transformers

Fastapi gunicorn multiple worker deployment #2832