Closed zmvictor closed 1 month ago
Hi @zmvictor just remembered that this breaks the benchmark automation that we have for vLLM where we still are using the /generate
API and not the /completions
API - https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/benchmarks/benchmark/tools/locust-load-inference/locust-docker/locust-tasks/tasks.py#L172. It would be good to address that too.
Per https://docs.vllm.ai/en/latest/serving/metrics.html, openai api server supports vLLM serving metrics by default. This PR therefore:
swap_space
argument suggested in vLLM benchmarkse2e tests with model
meta-llama/Llama-2-7b-chat-hf
. After terraform apply: