Open FrancescoCasalegno opened 1 year ago
Seldon-core seems to be the most recommended tool to deploy ML models in Kubernetes (1st google results and 3.3k+ starts on GitHub). https://www.datarevenue.com/en-blog/why-you-need-a-model-serving-tool-such-as-seldon
Other options are available: https://medium.com/everything-full-stack/machine-learning-model-serving-overview-c01a6aa3e823
Deploy the model as a Flask App: https://opensource.com/article/20/9/deep-learning-model-kubernetes Or using FastAPI (better than Flask!?): https://betterprogramming.pub/3-reasons-to-switch-to-fastapi-f9c788d017e5
BentoML / Yatai: https://github.com/bentoml/BentoML (3.9k+ stars) https://github.com/bentoml/Yatai (300+ stars)
Flask and FastAPI might not be a good solution as they do not scale wheel and might have performance issues. I'm currently testing Seldon and Yatai.
The default model for sentence embedding was deployed on a local Seldon server using the configuration below:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: minilm
namespace: seldon
spec:
protocol: v2
predictors:
- graph:
name: transformer
implementation: HUGGINGFACE_SERVER
parameters:
- name: task
type: STRING
value: feature-extraction
- name: pretrained_model
type: STRING
value: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
name: default
replicas: 1
A request to the model can be sent using the bluesearch.k8s.embeddings.embed_seldon function.
The average response time is 74±70ms
Context
sentence-transformer
embedding model deployed on Kubernetes, to be able to scale and avoid downtimes when users make their queries.Actions