Adding Multi-Node Triton + TRT-LLM Deployment on EKS example
This example shows:
LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods: To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use LeaderWorkerSet, which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in deployment.yaml and server.py.
Gang Scheduling: Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use kubessh to achieve this in the wait_for_workers function of server.py.
Autoscaling: By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in triton-metrics_prometheus-rule.yaml. We also demonstrate how to properly set up PodMonitors and an HPA in pod-monitor.yaml and hpa.yaml (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in Configure EKS Cluster and Install Dependencies. To enable deployment to dynamically add more nodes in reponse to HPA, we also setup Cluster Autoscaler
LoadBalancer Setup: Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in service.yaml
@amanshanbhag As discussed I have made the changes and I have updated the links and have updated the READMES and made sure that end-to-end flow makes sense
Adding Multi-Node Triton + TRT-LLM Deployment on EKS example
This example shows:
deployment.yaml
and server.py.kubessh
to achieve this in thewait_for_workers
function of server.py.triton-metrics_prometheus-rule.yaml
. We also demonstrate how to properly set up PodMonitors and an HPA inpod-monitor.yaml
andhpa.yaml
(the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in Configure EKS Cluster and Install Dependencies. To enable deployment to dynamically add more nodes in reponse to HPA, we also setup Cluster Autoscalerservice.yaml
Setup and Installation