Qwen2-72B-Instruct-gptq-int4重复问题

Storm0921 commented 3 weeks ago

机器A800，vLLM 0.5.0，prompt是开始，输出max tokens=2048，temperature设0.7

vLLM加载Qwen2-72B-Instruct-gptq-int4，使用vLLM的benchmark脚本来做并发测试，无论是1个并发限制还是10个并发限制，输出均会重复。 https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py

企业微信截图_1718760902864

企业微信截图_1718760893772

当然我也测试了无限制并发的情况下，也会生成重复

jklj077 commented 3 weeks ago

Hi, how did you start that script? Did you change the script to set the temperature?

Storm0921 commented 3 weeks ago

Hi, how did you start that script? Did you change the script to set the temperature?

https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L321 修改这一行的温度值。直接python run这个benchmark_serving.py，修改一下args里面的地址，模型啥的

jklj077 commented 3 weeks ago

Hi, if you have modified the script and would like to receive coherent responses, you probably also want to modify the reptition penalty, stop tokens, endpoint (use chat completions) and such.

Storm0921 commented 3 weeks ago

Hi, if you have modified the script and would like to receive coherent responses, you probably also want to modify the reptition penalty, stop tokens, endpoint (use chat completions) and such.

你能花点时间去试一下吗？，我设置的reptition penalty=1.0，stop= ["<|im_end|>", "<|endoftext|>"], endpoint is openai-chat。在Qwen2-72B-gptq-int4上就是那么重复，但是相同设置下Qwen1.5-72B-gptq-int4不会重复

jklj077 commented 2 weeks ago

Hi, I had tested the whole thing before my first comment.

The benchmarking scripts should not be used to check the quality of the response, because it lacks a lot of features.
Models are different and you should not expect the same hyperparameters can work across models, let alone across different generation of models.
If you have a specific case that the model cannot stop properly, please share the case.

Storm0921 commented 2 weeks ago

Hi, I had tested the whole thing before my first comment.

but, Have you not experienced any duplication? I see many buddies, just like me, experiencing repetition. What's going on...

jklj077 commented 2 weeks ago

Hi, please share the cases so that we can try to reproduce.

Storm0921 commented 2 weeks ago

vllm_benchmark.zip

@jklj077

jklj077 commented 2 weeks ago

Hi, the files you provided were heavily modified and a lot of things were hard-coded. After the backend kept giving me Bad Request, I just gave up.

But your request appears to be just "开始", and that's case I need. I can reproduce the issue with vllm==0.5.0.post1 and default generation parameters (temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05). About 10% of the responses are highly repetitive. But it did not happen to "结束" though.

Do you happen to have any other real world cases that can be shared?

Storm0921 commented 2 weeks ago

Hi, the files you provided were heavily modified and a lot of things were hard-coded. After the backend kept giving me Bad Request, I just gave up.

But your request appears to be just "开始", and that's case I need. I can reproduce the issue with vllm==0.5.0.post1 and default generation parameters (temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05). About 10% of the responses are highly repetitive. But it did not happen to "结束" though.

Do you happen to have any other real world cases that can be shared?

When using some conversation data for benchmark testing before, I encountered some occasional duplicate problems that were not recorded. I suggest that you use some open-source Chinese datasets for some more conversations, which should be able to reproduce them, thanks.

hqh312 commented 2 weeks ago

我也遇到了这个问题，模型推理会重复回答问题，直至超过max_model_len

jklj077 commented 2 weeks ago

@hqh312 Hi, which model, which framework, and are there any cases you could share?

stay-leave commented 1 week ago

我也在dify使用中出现了该问题。我用VLLM部署为OpenAI接口，在dify作为一个节点作为LLM，偶发该问题。奇怪的是我直接问LLM相同的东西没有问题。

hqh312 commented 5 days ago

@hqh312你好，什么模型，什么框架，有什么案例可以分享吗？'

模型：Qwen2-72B-Instruct-AWQ 框架：vllm 样例：输入：敏感话题提示
输出：不会直接拒绝回答，很繁琐的解释了半天，然后一直循环重复

整体qwen2要比qwen1.5回答的篇幅要长很多

QwenLM / Qwen2

Qwen2-72B-Instruct-gptq-int4重复问题 #675