PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.09k stars 2.93k forks source link

[Bug]: Llama3无法生成eos_token #8351

Closed holazzer closed 5 months ago

holazzer commented 6 months ago

软件环境

- paddlenlp: develop

重复问题

错误描述

Llama3无法生成 `eos_token`。在结束回答的生成后,可能会生成空白token填满后续;可能会循环生成 reserved token;可能会继续生成相关但是离题的内容。不管怎样,我不能让它生成 `eos_token` 。

是否是eos_token配置错误?在生成时,有提示 `model.config` 与 `model.generation_config` 有冲突。

稳定复现步骤 & 代码

from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
llm = 'meta-llama/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(llm)
model = AutoModelForCausalLM.from_pretrained(llm, dtype="float16")

def chat_with_llama3(msg, max_length=1024):
    input_features = tokenizer(msg, return_tensors="pd")
    outputs = model.generate(**input_features, max_new_tokens=max_length)
    return tokenizer.batch_decode(outputs[0])[0]

prompt = "User: 写一段关于大模型中的'人机对齐问题'的描述。使用中文回答。回答完请停止输出。\n\n Assistant: "
# prompt = "User: 写一段关于大模型中的'人机对齐问题'的描述。使用中文回答。\n\n Assistant: "
# prompt = "User: 给定GRE写作题目,你需要写作一篇能够得到起码5分的文章。题目: University should require every student to take courses outside their major. STOP after you are done answering. \n\n Assistant: "
# prompt = "User: 现在是2024年的夏天。设计一个线下营销活动,帮助麦当劳公司盈利。使用中文回答。回答完请停止输出。\n\n Assistant: "
out = chat_with_llama3(prompt)

# 为了查看生成的token
input_features = tokenizer(prompt, return_tensors="pd")
outputs = model.generate(**input_features, max_new_tokens=1024)
out = tokenizer.batch_decode(outputs[0])[0]
for i in outputs[0].numpy().flatten():
    print(i, tokenizer.decode(i))
以下是关于大模型中的'人机对齐问题'的描述:

大模型(Large Language Model)是指具有数十亿参数的深度学习模型,它们可以处理复杂的自然语言处理任务,如语言翻译、文本生成和问答等。然而,这些模型也存在一些挑战性问题之一是人机对齐问题(Human-Machine Alignment Problem)。

人机对齐问题是指在大模型中,人工标注的数据和机器学习算法之间的对齐问题。由于大模型的参数数量庞大,人工标注的数据可能无法涵盖所有可能的输入情况,从而导致模型在实际应用中出现不确定性和错误。

此外,大模型还可能存在其他问题,如数据不均衡、样本选择偏向、模型过拟合等,这些问题也会影响模型的性能和可靠性。因此,解决人机对齐问题和其他挑战性问题是大模型的关键步骤,以确保模型在实际应用中能够准确地处理自然语言处理任务。

停止输出。 请注意,这只是一个简短的描述,实际上解决人机对齐问题需要更深入的研究和解决方案。  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  Please do not output anything else.  Thank you!  👋  Bye! 👋  END OF OUTPUT.  
holazzer commented 6 months ago

你好!我已经找到了这个问题。Meta-Llama-3-8B-Instruct 在生成时,eos_token是另一special_token 即<|eot_id|>128009。但是,在tokenzier中,并没有正确加载这个special_token。

tokenzier_config.json p1

实际运行得到的tokenzier p2

HuggingFace上的示例 p3

我手动加入128009,可以成功让模型自然停止生成。 下面是麦当劳的例子。

messages = [
    {"role": "system", "content": "You are an expert at planning marketing events outdoors for small to medium size diners and restaurants. "},
    {"role": "user", "content": "Help a local McDonald restaurant plan a promotion event for the anniversary of Big Mac."},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pd"
)

terminators = [
    tokenizer.eos_token_id,
    # tokenizer.convert_tokens_to_ids("<|eot_id|>")
    128009,
]

outputs = model.generate(
    **input_ids,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

out = tokenizer.batch_decode( outputs[0] )
This plan should help create a fun and engaging event that will drive sales, increase brand loyalty, and generate buzz around the anniversary of the Big Mac.<|reserved_special_token_5|>

HF model card例子:

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
Arrrr, me hearty! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas o' the Interwebs! Me and me trusty crew o' code be here to swab the decks o' yer queries and answer yer questions with a pirate's flair! So hoist the colors, me hearty, and let's set sail fer a swashbucklin' good time!<|reserved_special_token_5|>

<|reserved_special_token_5|>应该为<|eot_id|>

我不了解paddlenlp如何加载多个config文件,请你们想办法把这个改一下吧,拜托了🙏。

ZHUI commented 5 months ago

好的,我们检查一下

ZHUI commented 5 months ago

https://github.com/PaddlePaddle/PaddleNLP/pull/8371 已提pr