Closed ZZR0 closed 1 year ago
Thanks for your interest in LMFlow! For finetuning decoder-only models, we normally concatenate questions & answers together and ask models to reproduce the whole sequence, so this label is necessary. For evaluation, this label is indeed unnecessary, as shown in our code src/lmflow/pipeline/evaluator.py
, no such group_texts
procedure is invoked.
Hope that answers your question 😄
Thanks for your reply. I think result["labels"] only used in loss calculation, so if we mask the questions ids in result["labels"] with -100, we still can reproduce the whole sequence. Moreover, I check other open-source LLM fine-tune project, they will mask the questions ids in result["labels"] with -100 (e.g., https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L133, https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py#L143). I want to know whether this setting will affect the model performance, and mask questions ids in result["labels"] with -100 seem more reasonable beacuse we care the answers rather than questions.
Hi, Thanks for your great suggestion! I think your comment is critical and may significantly affect the performance. Let's conduct more experiments and compare the performance. Do you want to contribute by creating a new fork and a new PR? (we can merge it after some comparisons.)
It seems that we need to refactor the preprocessing of the dataset to record question length and answer length, which may require you to reformat the dataset, so it would be more appropriate for you to do the refactoring, and I look forward to your updates and experimental results.
Thanks for your suggestions! Actually we have considered that problem during architecture design. In that case, users will just provide a "text2text" typed data for decoder-only models, and the question/answer length will be available for the preprocessing procedure. This feature is under implementation and we will let you know as soon as it is available. Thanks 😄
Hi,
We have supported the text2text
training by masking questions ids in result["labels"] with -100. Please convert your data to text2text
format.
Thank you for your great suggestion!
The result["labels"] is the same as result["input_ids"] which contain the prompt ids, so the loss function will compute the loss of prompt ids, and I think this is no what we want in QA scenario. Therefore, do we need to replace the prompt ids in the result["labels"] with -100 to make loss function ignore the loss of prompt ids?