Open aswathn1 opened 5 months ago
Hi ~ I have also been having issues with reproducing the selfrag-7B, I got a low evaluation result compared with eval results from paper。Whould you share your reproduction result from fine tune Llama-2 7B into selfrag 7B
I ran their finetuning script without making any changes and using the hyperparameter settings from their finetuning scripts on PopQA with retrieval using their pre-computed top-k files and was only able to get up to 34.28% on PopQA and 69.60% on PubHealth and str-em of 28.02 and rg of 35.76 on ASQA. Can you share the results you were able to reproduce? That would be helpful for context.
my results is as following: base-mode: llama2-hf (no llama-2-chat) epoch: 3 mode : always retrieval ( you have to attention the mode when retrieval, always retrieval and adaptive retrieval has diff performance PopQA: 0.546 PubHealth:0.678 arc : 0.569 which is lower then paper, but better then your result. Hope to help you~ I have a question about the wrong code in finetune.py
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
Have you attempted to train a model using the aforementioned code along with the corrected finetune code? How does the correct code influence the results?
I did and it did increase the numbers but still lower than the paper.
Could you please tell me how to modify the above code in finetune.py to make it correct~I would like to test whether the correct code can reproduce the results presented in the paper. ths a lots
I found finetune.py in self rag is revised based on open-instrction finetune
Hello! Thanks for sharing your great work.
I noticed a discrepancy in the way you setup the learning rate scheduler in finetune.py.
When you calculate:
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
Dividing by the total batch size across multiple GPUs should be giving the right number of update steps per epoch instead of gradient_accumulation_steps. This in turn affects your warmup schedule and also your linear decay schedule for your learning rate.I've also been having issues with reproducing your results with a locally fine-tuned Llama-2 7B model using your codebase and settings compared to your Huggingface checkpoint. So please let me know if you can share any feedback on any additional settings needed to reproduce the Huggingface chcekpoint level perfromance. Thank you.