Closed wpf19911118 closed 6 months ago
Is this latency issue also occurring with the upstream version?
Due to a prolonged period of inactivity, this issue will be closed. If the matter arises again or if there's further interest in pursuing this topic, please feel free to reopen the issue or create a new one. Thank you!
env: embeddedllminfo/vllm-rocm:vllm-v0.2.1.post1 paths: /app/vllm-rocm/benchmarks scripts: python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100
screnn results: root@a7:/app/vllm-rocm/benchmarks# python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100 Namespace(model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=32, batch_size=1, n=1, use_beam_search=False, num_iters=100, trust_remote_code=False, dtype='auto') INFO 11-14 04:33:49 llm_engine.py:72] Initializing an LLM engine with config: model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0) INFO 11-14 04:33:49 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. WARNING[XFORMERS]: Need to compile C++ extensions to use all xFormers features. Please install xformers properly (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") INFO 11-14 04:33:58 llm_engine.py:207] # GPU blocks: 5756, # CPU blocks: 512 SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], ignore_eos=True, max_tokens=32, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True) Warming up... Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.71it/s] Avg latency: 0.07859424050606321 seconds
The latency seems very abnormal. I tried modifying the output length, but got almost similar results. I wonder if only one token is generated?