Open showgood880702 opened 1 week ago
Hi, first thanks for your interest in LMFlow! Regarding to your questions:
conversation_template
only works for model training (finetuning) + conversation dataset (i.e., "type": "conversation"
in the .json file), and it is responsible for adding special tokens so that you don't need to adding those according to different models. See here for a dataset example, or you could
cd data
bash download.sh alpaca
and take the json file in train_conversation
as a reference.
For inference, you may try the following codes taken from llama hf repo for a temporary use:
import torch
from transformers import pipeline
model_id = "meta-llama/Llama-3.2-1B-Instruct" pipe = pipeline( "text-generation", model=model_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] outputs = pipe( messages, max_new_tokens=256, ) print(outputs[0]["generated_text"][-1])
The `chatbot.py` is outdated and we're planning to upgrade it. As of now, it is not compatible with instruction/chat models. Sorry for the inconvenience.
Thank you for the explanation. However, I'm still a bit confused about the conversation dataset structure. For the training dataset, should I put the templated dataset as {"type": "text_only", "instances": conversation_template}? It confuses me how I’m supposed to put data into {"type":"conversion","instances":[]} since it’s already a conversation template.
If the data is already templated, you could choose base on the expected behavior.
The reason why we design conversation
dataset is that we want to not only do the tokenization and templating but also mask the user inputs, system prompts, and tool information, since model can see them all at once and there's no need to generate them autoregressively. In other words, you do not need to train_on_prompt
. The conversation dataset also supports multi-round conversations, and the mask will look like [1,1,1,1,0,0,0,1,1,1,0,0,0], say, for a conversation that has two rounds.
You can use text_only
dataset type if you've already organized your conversation in one string. The json file then should look like:
{
"type": "text_only",
"instances": [
{"text": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI am a chatbot developed by LMFlow team.<|eot_id|>"},
{"text": "SOME_OTHER_TEMPLATED_TEXT_2"},
{"text": "SOME_OTHER_TEMPLATED_TEXT_3"},
]
}
However, we cannot mask on prompt in this case, since it is extremely hard to parse out the tokens that should be masked. In other words, you do train_on_prompt
.
Alternatively, text2text
dataset will mask all content in input
. If it's a single round conversation, it should be fine (no difference between a templated text2text
dataset and conversation
dataset once you set conversation_template
correctly).
{
"type": "text2text",
"instances": [
{
"input": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"output": "I am a chatbot developed by LMFlow team.<|eot_id|>",
},
{
"input": "SAMPLE_INPUT_2",
"output": "SAMPLE_OUTPUT_2",
},
{
"input": "SAMPLE_INPUT_3",
"output": "SAMPLE_OUTPUT_3",
},
]
}
Could you tell me how to use this conversation_template in the chatbot? I used a training dataset that follows the Llama-3 conversation_template, but there doesn’t seem to be an argument to set this conversation_template in the chatbot.py. Should I use --prompt_structure to include the Llama-3 template as an argument?
FYI, when training on Llama-3, should my dataset always follow its conversation_template?
Thank you so much.