generic vllm example - Githubissues

basetenlabs / truss-examples

Examples of models deployable with Truss

https://trussml.com

MIT License

130 stars 37 forks source link

generic vllm example #335

Closed tianshuc0731 closed 1 month ago

tianshuc0731 commented 1 month ago

Motivation Create this generic vllm example to enable a more codeless customer experience to use truss with vllm. The goal is to let customers create any customized vllm deployment by only modifying the config file, no need to touch model.py.

Changes In config.yaml

add new field openai_compatible to specify if users want to use openai compatible mode with vllm
add new field vllm_config to take all vllm engine parameter override for users

In model.py Combine two modes of using vllm in one file

when openai_compatible = false (standard mode), load() will initiate AsyncLLMEngine
when openai_compatible = true (openai compatible mode), load() will skip initiating AsyncLLMEngine but start a openai compatible inference server locally

Testing Successfully used this example to deploy the following models:

llama3.1 8B Instruct openai compatible mode
llama3.1 8B Instruct standard mode
Gemma 2 9B Instruct standard mode
Mistral 7B v2 vLLM AWQ - T4 (model quantization)
Ultravox v0.2 (customized vllm image)

To follow up

Figure out better way for server startup failure than using MAX_FAILED_SECONDS
Add speculative decoding example to README
Add health check to monitor vllm openai server