Latency and Throughput Inquiry

awslabs / llm-hosting-container

Large Language Model Hosting Container

Apache License 2.0

75 stars 31 forks source link

Latency and Throughput Inquiry #20

Open ctandrewtran opened 1 year ago

ctandrewtran commented 1 year ago

Hello-

I've been looking into hosting an LLM on AWS Infrastructure. I am mainly looking to host Flan T5 XXL. My question is below

Inquiry: what is the recommended container for hosting Flan T5 XXL? Context: I've hosted Flan T5 XXL using the TGI Container and the DJL-FasterTransformer container. Using the same Prompt, TGI takes around 5-6 seconds whereas the DJL-FasterTransformer container takes .5-1.5 seconds. The DJL-FasterTransformer Container has the tensor-parallel-degree set to 4. The SM_NM_GPU for TGI was set to 4. Both were hosted using ml.g5.12xlarge.

Are there recommended configs for the TGI Container that I might be missing?

xyang16 commented 1 year ago

@ctandrewtran

What is the recommended container for hosting Flan T5 XXL?

g5.12xlarge is recommended

Are there recommended configs for the TGI Container that I might be missing?

Could you please share your config?

ctandrewtran commented 1 year ago

@ctandrewtran

What is the recommended container for hosting Flan T5 XXL?

g5.12xlarge is recommended

Are there recommended configs for the TGI Container that I might be missing?

Could you please share your config?

I am using g5.12.xlarge and its deployed within a VPC

config= {
    'HF_MODEL_ID':'google/flan-t5-xxl',
    'SM_NUM_GPUS':'4',
    'HF_MODEL_QUANTIZE':'bitsandbytes'
}

With this, I am getting 5-6 seconds latency for a prompt + question whereas with DJL-FasterTransformer Container it is sub-second.

Is this the expected latency?

nth-attempt commented 1 year ago

Hi @ctandrewtran do you also do bitsandbytes quantization for FasterTransformer? Wondering if the latency differences are due to quantization!

ishaan-jaff commented 10 months ago

@ctandrewtran @nth-attempt

If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests (i'd love feedback if you are trying to do this)

Here's the quick start: doc: https://docs.litellm.ai/docs/simple_proxy#model-alias

Step 1 Create a Config.yaml

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "zephyr-beta",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'