Deploy embedding model on Kubernetes using Seldom

FrancescoCasalegno commented 1 year ago

Context

For simplicity (and to leverage our GPUs) we can compute manually the paragraph embeddings of the articles in our Elasticsearch database.
But at query time, it makes sense to have our sentence-transformer embedding model deployed on Kubernetes, to be able to scale and avoid downtimes when users make their queries.
Seldon seems to be a good solution to easily deploy our model on Kubernetes and provide a RESTful API to address requests from users.
Note that the goal of this stage of the Information Retrieval pipeline is to quickly retrieve a certain number of potentially relevant documents (e.g. ~1000), but we don't care too much about these results being ranked very accurately (this is happening in the re-ranking following stage). So it could be a good idea to use a smaller, faster sentence embedding model.

Actions

[x] Investigate if there are any (better?) alternatives to Seldon.
[ ] Deploy our sentence embedding model on Kubernetes using the best framework that we found.

drsantos89 commented 1 year ago

Seldon-core seems to be the most recommended tool to deploy ML models in Kubernetes (1st google results and 3.3k+ starts on GitHub). https://www.datarevenue.com/en-blog/why-you-need-a-model-serving-tool-such-as-seldon

Other options are available: https://medium.com/everything-full-stack/machine-learning-model-serving-overview-c01a6aa3e823

Deploy the model as a Flask App: https://opensource.com/article/20/9/deep-learning-model-kubernetes Or using FastAPI (better than Flask!?): https://betterprogramming.pub/3-reasons-to-switch-to-fastapi-f9c788d017e5

BentoML / Yatai: https://github.com/bentoml/BentoML (3.9k+ stars) https://github.com/bentoml/Yatai (300+ stars)

Flask and FastAPI might not be a good solution as they do not scale wheel and might have performance issues. I'm currently testing Seldon and Yatai.

drsantos89 commented 1 year ago

The default model for sentence embedding was deployed on a local Seldon server using the configuration below:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: minilm
  namespace: seldon
spec:
  protocol: v2
  predictors:
  - graph:
      name: transformer
      implementation: HUGGINGFACE_SERVER
      parameters:
      - name: task
        type: STRING
        value: feature-extraction
      - name: pretrained_model
        type: STRING
        value: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
    name: default
    replicas: 1

A request to the model can be sent using the bluesearch.k8s.embeddings.embed_seldon function.

drsantos89 commented 1 year ago

Screenshot 2022-09-30 at 14 09 04 The average response time is 74±70ms

BlueBrain / Search

Deploy embedding model on Kubernetes using Seldom #623

Context

Actions