Do we need to replace the prompt ids in the result["labels"] with -100?

OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.

https://optimalscale.github.io/LMFlow/

Apache License 2.0

8.22k stars 818 forks source link

Do we need to replace the prompt ids in the result["labels"] with -100? #163

Closed ZZR0 closed 1 year ago

ZZR0 commented 1 year ago

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model
    # supported it instead of this drop, you can customize this part to
    # your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

The result["labels"] is the same as result["input_ids"] which contain the prompt ids, so the loss function will compute the loss of prompt ids, and I think this is no what we want in QA scenario. Therefore, do we need to replace the prompt ids in the result["labels"] with -100 to make loss function ignore the loss of prompt ids?

research4pan commented 1 year ago

Thanks for your interest in LMFlow! For finetuning decoder-only models, we normally concatenate questions & answers together and ask models to reproduce the whole sequence, so this label is necessary. For evaluation, this label is indeed unnecessary, as shown in our code src/lmflow/pipeline/evaluator.py, no such group_texts procedure is invoked.

Hope that answers your question 😄

ZZR0 commented 1 year ago

Thanks for your reply. I think result["labels"] only used in loss calculation, so if we mask the questions ids in result["labels"] with -100, we still can reproduce the whole sequence. Moreover, I check other open-source LLM fine-tune project, they will mask the questions ids in result["labels"] with -100 (e.g., https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L133, https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py#L143). I want to know whether this setting will affect the model performance, and mask questions ids in result["labels"] with -100 seem more reasonable beacuse we care the answers rather than questions.

shizhediao commented 1 year ago

Hi, Thanks for your great suggestion! I think your comment is critical and may significantly affect the performance. Let's conduct more experiments and compare the performance. Do you want to contribute by creating a new fork and a new PR? (we can merge it after some comparisons.)

ZZR0 commented 1 year ago

It seems that we need to refactor the preprocessing of the dataset to record question length and answer length, which may require you to reformat the dataset, so it would be more appropriate for you to do the refactoring, and I look forward to your updates and experimental results.

research4pan commented 1 year ago

Thanks for your suggestions! Actually we have considered that problem during architecture design. In that case, users will just provide a "text2text" typed data for decoder-only models, and the question/answer length will be available for the preprocessing procedure. This feature is under implementation and we will let you know as soon as it is available. Thanks 😄

shizhediao commented 1 year ago

Hi, We have supported the text2text training by masking questions ids in result["labels"] with -100. Please convert your data to text2text format. Thank you for your great suggestion!