EmbeddedLLM / vllm-rocm

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
https://vllm.readthedocs.io
Apache License 2.0
83 stars 5 forks source link

benchmark-latncy test bug??? #11

Closed wpf19911118 closed 6 months ago

wpf19911118 commented 7 months ago

env: embeddedllminfo/vllm-rocm:vllm-v0.2.1.post1 paths: /app/vllm-rocm/benchmarks scripts: python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100

screnn results: root@a7:/app/vllm-rocm/benchmarks# python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100 Namespace(model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=32, batch_size=1, n=1, use_beam_search=False, num_iters=100, trust_remote_code=False, dtype='auto') INFO 11-14 04:33:49 llm_engine.py:72] Initializing an LLM engine with config: model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0) INFO 11-14 04:33:49 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. WARNING[XFORMERS]: Need to compile C++ extensions to use all xFormers features. Please install xformers properly (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") INFO 11-14 04:33:58 llm_engine.py:207] # GPU blocks: 5756, # CPU blocks: 512 SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], ignore_eos=True, max_tokens=32, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True) Warming up... Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.71it/s] Avg latency: 0.07859424050606321 seconds

The latency seems very abnormal. I tried modifying the output length, but got almost similar results. I wonder if only one token is generated?

tanpinsiang commented 7 months ago

Is this latency issue also occurring with the upstream version?

tanpinsiang commented 6 months ago

Due to a prolonged period of inactivity, this issue will be closed. If the matter arises again or if there's further interest in pursuing this topic, please feel free to reopen the issue or create a new one. Thank you!