aws-samples / awsome-inference

MIT No Attribution
31 stars 10 forks source link

add lws based multi-node triton trtllm example #20

Closed kshitizgupta21 closed 1 month ago

kshitizgupta21 commented 2 months ago

Adding Multi-Node Triton + TRT-LLM Deployment on EKS example

This example shows:

  1. LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods: To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use LeaderWorkerSet, which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in deployment.yaml and server.py.
  2. Gang Scheduling: Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use kubessh to achieve this in the wait_for_workers function of server.py.
  3. Autoscaling: By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in triton-metrics_prometheus-rule.yaml. We also demonstrate how to properly set up PodMonitors and an HPA in pod-monitor.yaml and hpa.yaml (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in Configure EKS Cluster and Install Dependencies. To enable deployment to dynamically add more nodes in reponse to HPA, we also setup Cluster Autoscaler
  4. LoadBalancer Setup: Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in service.yaml

Setup and Installation

  1. Create EKS Cluster
  2. Configure EKS Cluster
  3. Deploy Triton
amanshanbhag commented 2 months ago

These files need to go into 1.infrastructure:

  1. Create_EKS_Cluster.md and eks_cluster_config.yaml
  2. pvc/
  3. nccl_test.yaml
kshitizgupta21 commented 1 month ago

@amanshanbhag As discussed I have made the changes and I have updated the links and have updated the READMES and made sure that end-to-end flow makes sense