ggerganov / llama.cpp

LLM inference in C/C++
MIT License
58.48k stars 8.29k forks source link

kubernetes example #6546

Open phymbert opened 1 month ago

phymbert commented 1 month ago

Motivation

Kubernetes is widely used in the industry to deploy product and application at scale.

It can be useful for the community to have a llama.cpp helm chart for the server.

I have started several weeks ago, I will continue when I have more time, meanwhile any help is welcomed:

https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes

References

OmegAshEnr01n commented 1 month ago

Hi! I will take this up!

phymbert commented 1 month ago

Great @OmegAshEnr01n , few notes:

Ping here if you have question, good luck ! Excited to use it.

phymbert commented 1 month ago

Hi @OmegAshEnr01n, are you still working on this issue ?

OmegAshEnr01n commented 1 month ago

Yes, still am. Will share a pull request over the weekend when completed.

OmegAshEnr01n commented 3 weeks ago

Hi @phymbert

What is the architecutral reason for having embedding living on a seperate deployment to the model? Becuase requiring that would mean we would need to make changes to the http server. Instead of that we can have an architecture where model and embedding is tightly coupled. Something like this

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      {{- range $i, $container := .Values.containers }}
      - name: my-container-{{ $i }}
        image: {{ $container.image }}
        volumeMounts:
        - name: data-volume-{{ $i }}
          mountPath: /data
      {{- end }}
      volumes:
      {{- range $i, $container := .Values.containers }}
      - name: data-volume-{{ $i }}
        persistentVolumeClaim:
          claimName: pvc-{{ $i }}
      {{- end }}

On another note, What is the intended use of prometheus? Do you need it to live alongside the helm chart or within it as a subchart? I dont see the value in adding prometheus as a subchart. Perhaps you can share your view on it as well.

phymbert commented 3 weeks ago

Embeddings model are different from the generative ones. In an RAG setup you need two models.

Prometheus is not required but if present metrics are exported.

OmegAshEnr01n commented 3 weeks ago

Ok, Just to clarify, the server.cpp has a route for requesting embeddings but the existing code for the server doesnt include the option to send embeddings for completions . That will need to be written before the helm chart can be completed. Kindly correct me if im wrong.

phymbert commented 3 weeks ago

Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on. There is nothing to do with the server code.

ceddybi commented 2 weeks ago

@OmegAshEnr01n Sir, is the chart ready for production ? 🚀🚀🚀🚀

OmegAshEnr01n commented 2 weeks ago

Not yet. Currently testing it on a personal kube cluster with separate node selectors.