HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
44 stars 59 forks source link

[Usage]: The prompt bucket shape will not impact the performance #209

Open JunxiChhen opened 3 months ago

JunxiChhen commented 3 months ago

Your current environment

I am testing the offline performance using benchmark_latency.py. And I found there's no any increasing/decreasing when I change the prompt bucket shape, even I use (1,1).

How would you like to use vllm

Prompt bucket config (min, step, max_warmup) bs:[1, 1, 1], seq:[1, 1, 1]:

6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Prompt captured:1 (100.0%) used_mem:0 B buckets:[(1, 1)] 6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)] 6: INFO 08-28 11:12:47 habana_model_runner.py:1206] Warmup finished in 45 secs, allocated 3.451 GiB of device memory 6: INFO 08-28 11:12:47 habana_executor.py:91] init_cache_engine took 46.45 GiB of device memory (61.43 GiB/94.62 GiB used) and 2.484 GiB of host memory (61.33 GiB/1007 GiB used) 6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None) 6: Warming up... Warmup iterations: 100%|██████████| 5/5 [01:08<00:00, 13.71s/it] Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.64s/it] 6: E2E Throughput: 1200.877 tokens/sec.

Prompt bucket config (min, step, max_warmup) bs:[1, 64, 128], seq:[128, 128, 1024]:

6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Prompt captured:18 (28.1%) used_mem:19.79 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (4, 128), (4, 256), (4, 384), (8, 128)] 6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)] 6: INFO 08-28 11:25:56 habana_model_runner.py:1206] Warmup finished in 146 secs, allocated 23.03 GiB of device memory 6: INFO 08-28 11:25:56 habana_executor.py:91] init_cache_engine took 61.18 GiB of device memory (85.15 GiB/94.62 GiB used) and 2.813 GiB of host memory (61.55 GiB/1007 GiB used) 6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None) 6: Warming up... Warmup iterations: 100%|██████████| 5/5 [01:07<00:00, 13.57s/it] Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.61s/it] 6: E2E Throughput: 1203.730 tokens/sec.

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!