OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.27k stars 828 forks source link

Conversation_template #917

Open showgood880702 opened 1 week ago

showgood880702 commented 1 week ago

Could you tell me how to use this conversation_template in the chatbot? I used a training dataset that follows the Llama-3 conversation_template, but there doesn’t seem to be an argument to set this conversation_template in the chatbot.py. Should I use --prompt_structure to include the Llama-3 template as an argument?

FYI, when training on Llama-3, should my dataset always follow its conversation_template?

Thank you so much.

wheresmyhair commented 1 week ago

Hi, first thanks for your interest in LMFlow! Regarding to your questions:

  1. conversation_template only works for model training (finetuning) + conversation dataset (i.e., "type": "conversation" in the .json file), and it is responsible for adding special tokens so that you don't need to adding those according to different models. See here for a dataset example, or you could

    cd data
    bash download.sh alpaca

    and take the json file in train_conversation as a reference.

  2. For inference, you may try the following codes taken from llama hf repo for a temporary use:

    
    import torch
    from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B-Instruct" pipe = pipeline( "text-generation", model=model_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] outputs = pipe( messages, max_new_tokens=256, ) print(outputs[0]["generated_text"][-1])


The `chatbot.py` is outdated and we're planning to upgrade it. As of now, it is not compatible with instruction/chat models. Sorry for the inconvenience. 
showgood880702 commented 6 days ago

Thank you for the explanation. However, I'm still a bit confused about the conversation dataset structure. For the training dataset, should I put the templated dataset as {"type": "text_only", "instances": conversation_template}? It confuses me how I’m supposed to put data into {"type":"conversion","instances":[]} since it’s already a conversation template.

wheresmyhair commented 6 days ago

If the data is already templated, you could choose base on the expected behavior. The reason why we design conversation dataset is that we want to not only do the tokenization and templating but also mask the user inputs, system prompts, and tool information, since model can see them all at once and there's no need to generate them autoregressively. In other words, you do not need to train_on_prompt. The conversation dataset also supports multi-round conversations, and the mask will look like [1,1,1,1,0,0,0,1,1,1,0,0,0], say, for a conversation that has two rounds.

You can use text_only dataset type if you've already organized your conversation in one string. The json file then should look like:

{
  "type": "text_only",
  "instances": [
        {"text": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI am a chatbot developed by LMFlow team.<|eot_id|>"},
        {"text": "SOME_OTHER_TEMPLATED_TEXT_2"},
        {"text": "SOME_OTHER_TEMPLATED_TEXT_3"},
  ]
}

However, we cannot mask on prompt in this case, since it is extremely hard to parse out the tokens that should be masked. In other words, you do train_on_prompt.

Alternatively, text2text dataset will mask all content in input. If it's a single round conversation, it should be fine (no difference between a templated text2text dataset and conversation dataset once you set conversation_template correctly).

{
  "type": "text2text",
  "instances": [
    {
        "input": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "I am a chatbot developed by LMFlow team.<|eot_id|>",
    },
    {
        "input": "SAMPLE_INPUT_2",
        "output": "SAMPLE_OUTPUT_2",
    },
    {
        "input": "SAMPLE_INPUT_3",
        "output": "SAMPLE_OUTPUT_3",
    },
  ]
}