[Usage]: The prompt bucket shape will not impact the performance

Your current environment

I am testing the offline performance using benchmark_latency.py. And I found there's no any increasing/decreasing when I change the prompt bucket shape, even I use (1,1).

How would you like to use vllm

Prompt bucket config (min, step, max_warmup) bs:[1, 1, 1], seq:[1, 1, 1]:

6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Prompt captured:1 (100.0%) used_mem:0 B buckets:[(1, 1)] 6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)] 6: INFO 08-28 11:12:47 habana_model_runner.py:1206] Warmup finished in 45 secs, allocated 3.451 GiB of device memory 6: INFO 08-28 11:12:47 habana_executor.py:91] init_cache_engine took 46.45 GiB of device memory (61.43 GiB/94.62 GiB used) and 2.484 GiB of host memory (61.33 GiB/1007 GiB used) 6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None) 6: Warming up... Warmup iterations: 100%|██████████| 5/5 [01:08<00:00, 13.71s/it] Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.64s/it] 6: E2E Throughput: 1200.877 tokens/sec.

Prompt bucket config (min, step, max_warmup) bs:[1, 64, 128], seq:[128, 128, 1024]:

6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Prompt captured:18 (28.1%) used_mem:19.79 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (4, 128), (4, 256), (4, 384), (8, 128)] 6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)] 6: INFO 08-28 11:25:56 habana_model_runner.py:1206] Warmup finished in 146 secs, allocated 23.03 GiB of device memory 6: INFO 08-28 11:25:56 habana_executor.py:91] init_cache_engine took 61.18 GiB of device memory (85.15 GiB/94.62 GiB used) and 2.813 GiB of host memory (61.55 GiB/1007 GiB used) 6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None) 6: Warming up... Warmup iterations: 100%|██████████| 5/5 [01:07<00:00, 13.57s/it] Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.61s/it] 6: E2E Throughput: 1203.730 tokens/sec.

HabanaAI / vllm-fork

[Usage]: The prompt bucket shape will not impact the performance #209

Your current environment

How would you like to use vllm