Open JunxiChhen opened 3 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
I am testing the offline performance using
benchmark_latency.py
. And I found there's no any increasing/decreasing when I change the prompt bucket shape, even I use (1,1).How would you like to use vllm
Prompt bucket config (min, step, max_warmup) bs:[1, 1, 1], seq:[1, 1, 1]:
6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Prompt captured:1 (100.0%) used_mem:0 B buckets:[(1, 1)] 6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)] 6: INFO 08-28 11:12:47 habana_model_runner.py:1206] Warmup finished in 45 secs, allocated 3.451 GiB of device memory 6: INFO 08-28 11:12:47 habana_executor.py:91] init_cache_engine took 46.45 GiB of device memory (61.43 GiB/94.62 GiB used) and 2.484 GiB of host memory (61.33 GiB/1007 GiB used) 6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None) 6: Warming up... Warmup iterations: 100%|██████████| 5/5 [01:08<00:00, 13.71s/it] Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.64s/it] 6: E2E Throughput: 1200.877 tokens/sec.
Prompt bucket config (min, step, max_warmup) bs:[1, 64, 128], seq:[128, 128, 1024]:
6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Prompt captured:18 (28.1%) used_mem:19.79 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (4, 128), (4, 256), (4, 384), (8, 128)] 6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)] 6: INFO 08-28 11:25:56 habana_model_runner.py:1206] Warmup finished in 146 secs, allocated 23.03 GiB of device memory 6: INFO 08-28 11:25:56 habana_executor.py:91] init_cache_engine took 61.18 GiB of device memory (85.15 GiB/94.62 GiB used) and 2.813 GiB of host memory (61.55 GiB/1007 GiB used) 6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None) 6: Warming up... Warmup iterations: 100%|██████████| 5/5 [01:07<00:00, 13.57s/it] Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.61s/it] 6: E2E Throughput: 1203.730 tokens/sec.