Closed YJiangcm closed 5 months ago
Thanks for your interest in our work.
+ SFT
denotes utilizing the SFT data to further fine-tune the SFT LLM
.
Hope my answer can help you.
Thanks for your reply. So the SFT LLM
means llama2-7b-chat; + SFT
means further fine-tuning llama2-7b-chat using QA or math data; the below methods such as +RFT
, +DPO
mean further training based on llama2-7b-chat.
Is my understanding correct?
There might be some difference between your understanding and our experiment about the meaning of SFT LLM
.
SFT LLM
in QA task denotes using the mixture of the training set of ECQA and QASC (ie, process_data/Gen_Samples/data/qa.jsonl
) to train llama2-7b to make it adapt to the QA task.
+SFT
, RFT
and DPO
mean further training based on the SFT LLM
, which is the same as your understanding.
Hope the answer can address your questions.
Thanks. My last question is: since SFT LLM
uses ECQA and QASC as the training data, then what is the training data of +SFT
?
Also the training set of ECQA and QASC.
Thanks for your excellent work!
In Table 3 of your paper, I wonder what is the difference between "SFT LLM" and "+ SFT"?
Looking forward to your reply.