AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
https://selfrag.github.io/
MIT License
1.83k stars 170 forks source link

Incorrect setup of Learning Rate Scheduler #81

Open aswathn1 opened 5 months ago

aswathn1 commented 5 months ago

Hello! Thanks for sharing your great work.

I noticed a discrepancy in the way you setup the learning rate scheduler in finetune.py.

When you calculate: num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) Dividing by the total batch size across multiple GPUs should be giving the right number of update steps per epoch instead of gradient_accumulation_steps. This in turn affects your warmup schedule and also your linear decay schedule for your learning rate.

I've also been having issues with reproducing your results with a locally fine-tuned Llama-2 7B model using your codebase and settings compared to your Huggingface checkpoint. So please let me know if you can share any feedback on any additional settings needed to reproduce the Huggingface chcekpoint level perfromance. Thank you.

fate-ubw commented 4 months ago

Hi ~ I have also been having issues with reproducing the selfrag-7B, I got a low evaluation result compared with eval results from paper。Whould you share your reproduction result from fine tune Llama-2 7B into selfrag 7B

aswathn1 commented 4 months ago

I ran their finetuning script without making any changes and using the hyperparameter settings from their finetuning scripts on PopQA with retrieval using their pre-computed top-k files and was only able to get up to 34.28% on PopQA and 69.60% on PubHealth and str-em of 28.02 and rg of 35.76 on ASQA. Can you share the results you were able to reproduce? That would be helpful for context.

fate-ubw commented 4 months ago

my results is as following: base-mode: llama2-hf (no llama-2-chat) epoch: 3 mode : always retrieval ( you have to attention the mode when retrieval, always retrieval and adaptive retrieval has diff performance PopQA: 0.546 PubHealth:0.678 arc : 0.569 which is lower then paper, but better then your result. Hope to help you~ I have a question about the wrong code in finetune.py

num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)

Have you attempted to train a model using the aforementioned code along with the corrected finetune code? How does the correct code influence the results?

aswathn1 commented 4 months ago

I did and it did increase the numbers but still lower than the paper.

fate-ubw commented 4 months ago

Could you please tell me how to modify the above code in finetune.py to make it correct~I would like to test whether the correct code can reproduce the results presented in the paper. ths a lots

fate-ubw commented 4 months ago

I found finetune.py in self rag is revised based on open-instrction finetune