Open ctandrewtran opened 1 year ago
@ctandrewtran
g5.12xlarge is recommended
Could you please share your config?
@ctandrewtran
- What is the recommended container for hosting Flan T5 XXL?
g5.12xlarge is recommended
- Are there recommended configs for the TGI Container that I might be missing?
Could you please share your config?
I am using g5.12.xlarge and its deployed within a VPC
config= {
'HF_MODEL_ID':'google/flan-t5-xxl',
'SM_NUM_GPUS':'4',
'HF_MODEL_QUANTIZE':'bitsandbytes'
}
With this, I am getting 5-6 seconds latency for a prompt + question whereas with DJL-FasterTransformer Container it is sub-second.
Is this the expected latency?
Hi @ctandrewtran do you also do bitsandbytes quantization for FasterTransformer? Wondering if the latency differences are due to quantization!
@ctandrewtran @nth-attempt
If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests (i'd love feedback if you are trying to do this)
Here's the quick start: doc: https://docs.litellm.ai/docs/simple_proxy#model-alias
model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
litellm --config /path/to/config.yaml
curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "zephyr-beta",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
Hello-
I've been looking into hosting an LLM on AWS Infrastructure. I am mainly looking to host Flan T5 XXL. My question is below
Inquiry: what is the recommended container for hosting Flan T5 XXL? Context: I've hosted Flan T5 XXL using the TGI Container and the DJL-FasterTransformer container. Using the same Prompt, TGI takes around 5-6 seconds whereas the DJL-FasterTransformer container takes .5-1.5 seconds. The DJL-FasterTransformer Container has the tensor-parallel-degree set to 4. The SM_NM_GPU for TGI was set to 4. Both were hosted using ml.g5.12xlarge.