QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
9.36k stars 574 forks source link

通过vllm部署Qwen1.5-14B后,模型推理能力急剧下降 #462

Closed xqxls closed 2 months ago

xqxls commented 5 months ago

这是我输入的提示词: 我想让你扮演Java开发工程师的面试官,我将作为候选人回答问题。请按照以下要求进行: 1.你负责提问Java开发工程师职位的面试问题。 2.每次只问一个问题。 3.只提问,不要解释你提的问题,不要回答你提的问题。 请问第一个问题

这是大模型给出的回答: 请简述一下Java中垃圾回收机制(Garbage Collection)的工作原理和你在项目中如何优化内存使用? 下一个问题 在Java中,你如何实现线程间的通信,特别是在多线程并发编程时? 请解释一下Java的异常处理模型,包括try-catch-finally块的作用? 描述一下你在Java中如何实现一个线程安全的集合,并给出一个具体的例子? 在设计高并发系统的时,你会如何选择使用同步原语(synchronized)还是并发集合(Concurrent Collections)?请给出你的理由。 请简述一下Spring框架在Java Web开发中的核心作用,以及它如何提高开发效率? 在处理大数据(如Hadoop MapReduce)时,Java的哪些特性或库对你最有帮助?请给出一个相关的场景说明。 请分享一个你在解决Java性能问题时的具体案例,你是如何分析和优化的? 在使用Java进行网络编程时,你会如何处理Socket连接的异常?请描述一个常见的异常处理策略。

试验了十次,有五次会出现这种理解不了提示词的情况。 没有使用vllm部署的时候,大模型基本都能理解提示词。

jklj077 commented 5 months ago

Hi, please provide steps to reproduce.

haibincoder commented 5 months ago

我遇到了同样的问题,通过peft微调后的模型,通过vllm部署和fastapi+transformers部署结果差别很大,不知道哪个参数有问题

vllm:

CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-14B-Chat \
--port 8010  \
--max-model-len 2048 \
--tensor-parallel-size 2 \
--enable-lora \
--lora-modules sql-lora=/data/shared_models/14B-subCME-seed0523-rnd10-frac0.7-rank8-epoch1 \
--gpu-memory-utilization 0.8 \

transformers:

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config.max_new_tokens = 2048
model.generation_config.temperature = 0.1

text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')

generated_ids = model.generate(
            model_inputs.input_ids,
            max_new_tokens=512
        )
generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

尝试过以下方案,还没解决: https://github.com/01-ai/Yi-1.5/issues/7 https://zhuanlan.zhihu.com/p/658780653

github-actions[bot] commented 4 months ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

2212168851 commented 4 months ago

I encountered the same problem,help

jklj077 commented 4 months ago

Please check if the generation configruation is consistent, including temperature, top_p, topk_k, repetition_penalty and such. In addition, consider merging the LoRA adapters when comparing results. If you have cases that could be shared, please share so that we can look into it.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.