GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
186 stars 140 forks source link

Support vllm openai api server #694

Closed zmvictor closed 1 month ago

zmvictor commented 1 month ago

Per https://docs.vllm.ai/en/latest/serving/metrics.html, openai api server supports vLLM serving metrics by default. This PR therefore:

e2e tests with model meta-llama/Llama-2-7b-chat-hf. After terraform apply:

# Get vLLM LB's external IP
$ VLLM_EXTERNAL_IP=`kubectl -n benchmark get service vllm -o jsonpath='{.status.loadBalancer.ingress[0].ip}'`

# send a prompt to the endpoint
$ curl $VLLM_EXTERNAL_IP/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "Seattle City is a",
        "max_tokens": 7,
        "temperature": 0
    }'

# Check prometheus metrics
$ curl $VLLM_EXTERNAL_IP/metrics/

...
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="meta-llama/Llama-2-7b-chat-hf"} 9.0
...
achandrasekar commented 1 month ago

Hi @zmvictor just remembered that this breaks the benchmark automation that we have for vLLM where we still are using the /generate API and not the /completions API - https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/benchmarks/benchmark/tools/locust-load-inference/locust-docker/locust-tasks/tasks.py#L172. It would be good to address that too.