Open Storm0921 opened 3 weeks ago
Hi, how did you start that script? Did you change the script to set the temperature?
Hi, how did you start that script? Did you change the script to set the temperature?
https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L321 修改这一行的温度值。直接python run这个benchmark_serving.py,修改一下args里面的地址,模型啥的
Hi, if you have modified the script and would like to receive coherent responses, you probably also want to modify the reptition penalty, stop tokens, endpoint (use chat completions) and such.
Hi, if you have modified the script and would like to receive coherent responses, you probably also want to modify the reptition penalty, stop tokens, endpoint (use chat completions) and such.
你能花点时间去试一下吗?,我设置的reptition penalty=1.0,stop= ["<|im_end|>", "<|endoftext|>"], endpoint is openai-chat。 在Qwen2-72B-gptq-int4上就是那么重复,但是相同设置下Qwen1.5-72B-gptq-int4不会重复
Hi, I had tested the whole thing before my first comment.
Hi, I had tested the whole thing before my first comment.
but, Have you not experienced any duplication? I see many buddies, just like me, experiencing repetition. What's going on...
Hi, please share the cases so that we can try to reproduce.
@jklj077
Hi, the files you provided were heavily modified and a lot of things were hard-coded. After the backend kept giving me Bad Request, I just gave up.
But your request appears to be just "开始", and that's case I need. I can reproduce the issue with vllm==0.5.0.post1 and default generation parameters (temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05). About 10% of the responses are highly repetitive. But it did not happen to "结束" though.
Do you happen to have any other real world cases that can be shared?
Hi, the files you provided were heavily modified and a lot of things were hard-coded. After the backend kept giving me Bad Request, I just gave up.
But your request appears to be just "开始", and that's case I need. I can reproduce the issue with vllm==0.5.0.post1 and default generation parameters (temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05). About 10% of the responses are highly repetitive. But it did not happen to "结束" though.
Do you happen to have any other real world cases that can be shared?
When using some conversation data for benchmark testing before, I encountered some occasional duplicate problems that were not recorded. I suggest that you use some open-source Chinese datasets for some more conversations, which should be able to reproduce them, thanks.
我也遇到了这个问题,模型推理会重复回答问题,直至超过max_model_len
@hqh312 Hi, which model, which framework, and are there any cases you could share?
我也在dify使用中出现了该问题。 我用VLLM部署为OpenAI接口,在dify作为一个节点作为LLM,偶发该问题。奇怪的是我直接问LLM相同的东西没有问题。
@hqh312你好,什么模型,什么框架,有什么案例可以分享吗?'
模型:Qwen2-72B-Instruct-AWQ
框架:vllm
样例:
输入:敏感话题提示
输出:不会直接拒绝回答,很繁琐的解释了半天,然后一直循环重复
整体qwen2要比qwen1.5回答的篇幅要长很多
机器A800,vLLM 0.5.0,prompt是开始,输出max tokens=2048,temperature设0.7
vLLM加载Qwen2-72B-Instruct-gptq-int4,使用vLLM的benchmark脚本来做并发测试,无论是1个并发限制还是10个并发限制,输出均会重复。 https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
当然我也测试了无限制并发的情况下,也会生成重复