Open Jeremy-Hibiki opened 5 days ago
Additional info:
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input = tokenizer([prompt], return_tensors="pt").to(model.device)
context_length = input.input_ids.shape[-1]
output = model.generate(
**input,
max_new_tokens=max_new_tokens,
num_beams=1,
do_sample=True,
temperature=0.5,
)[0]
command:
- --model
- /data/models/Qwen2.5-14B-Instruct-AWQ
- --served-model-name
- Qwen2.5-14B-Instruct
- --max-model-len
- '32768'
- --max-seq-len-to-capture
- '32768'
- --enable-chunked-prefill
- --max-num-batched-tokens
- '4096'
- --use-v2-block-manager
- --kv-cache-dtype
- 'fp8'
- --enable-auto-tool-choice
- --tool-call-parser
- hermes
- --disable-log-requests
Model Series
Qwen2.5
What are the models used?
Qwen2.5-14B-Instruct-AWQ (and maybe the origin one)
What is the scenario where the problem happened?
Qwen2.5 14B can't stop generation with either transformers or vLLM
Is this badcase known and can it be solved using avaiable techniques?
Information about environment
NV Driver 560.28.03, Cuda 12.1, 4 x L40S
Description
Trying to evaluate qwen2.5 models' long generation capability, using LongBench-Write proposed by GLM team (https://github.com/THUDM/LongWriter#evaluation)
Badcase prompts:
Poe share link: https://poe.com/s/LRQaLwIRJchk4fvl3ey0
Local generation result with vLLM