Timothy023 / RLMEC

The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"
8 stars 4 forks source link

Question about the SFT method #3

Closed YJiangcm closed 3 months ago

YJiangcm commented 3 months ago

Thanks for your excellent work!

In Table 3 of your paper, I wonder what is the difference between "SFT LLM" and "+ SFT"?

image

Looking forward to your reply.

Timothy023 commented 3 months ago

Thanks for your interest in our work.

+ SFT denotes utilizing the SFT data to further fine-tune the SFT LLM.

Hope my answer can help you.

YJiangcm commented 3 months ago

Thanks for your reply. So the SFT LLM means llama2-7b-chat; + SFT means further fine-tuning llama2-7b-chat using QA or math data; the below methods such as +RFT, +DPO mean further training based on llama2-7b-chat.

Is my understanding correct?

Timothy023 commented 3 months ago

There might be some difference between your understanding and our experiment about the meaning of SFT LLM.

SFT LLM in QA task denotes using the mixture of the training set of ECQA and QASC (ie, process_data/Gen_Samples/data/qa.jsonl) to train llama2-7b to make it adapt to the QA task.

+SFT, RFT and DPO mean further training based on the SFT LLM, which is the same as your understanding.

Hope the answer can address your questions.

YJiangcm commented 3 months ago

Thanks. My last question is: since SFT LLM uses ECQA and QASC as the training data, then what is the training data of +SFT?

Timothy023 commented 3 months ago

Also the training set of ECQA and QASC.