Motivation
Create this generic vllm example to enable a more codeless customer experience to use truss with vllm. The goal is to let customers create any customized vllm deployment by only modifying the config file, no need to touch model.py.
Changes
In config.yaml
add new field openai_compatible to specify if users want to use openai compatible mode with vllm
In model.py
Combine two modes of using vllm in one file
when openai_compatible = false (standard mode), load() will initiate AsyncLLMEngine
when openai_compatible = true (openai compatible mode), load() will skip initiating AsyncLLMEngine but start a openai compatible inference server locally
Testing
Successfully used this example to deploy the following models:
llama3.1 8B Instruct openai compatible mode
llama3.1 8B Instruct standard mode
Gemma 2 9B Instruct standard mode
Mistral 7B v2 vLLM AWQ - T4 (model quantization)
Ultravox v0.2 (customized vllm image)
To follow up
Figure out better way for server startup failure than using MAX_FAILED_SECONDS
Motivation Create this generic vllm example to enable a more codeless customer experience to use truss with vllm. The goal is to let customers create any customized vllm deployment by only modifying the config file, no need to touch model.py.
Changes In
config.yaml
openai_compatible
to specify if users want to use openai compatible mode with vllmvllm_config
to take all vllm engine parameter override for usersIn
model.py
Combine two modes of using vllm in one fileopenai_compatible = false
(standard mode), load() will initiateAsyncLLMEngine
openai_compatible = true
(openai compatible mode), load() will skip initiatingAsyncLLMEngine
but start a openai compatible inference server locallyTesting Successfully used this example to deploy the following models:
To follow up
MAX_FAILED_SECONDS