Closed MichaelMtl closed 1 month ago
@MichaelMtl I apologize for the delay in responding. Thank you for supporting our work. For the CodeLlama test, we selected 4 to 8 Python-related samples from the StarCoder data to build the draft model.
OK thank you for clarifying this!
I was also wondering if you changed the prompt for the LLaMA-2-13B-Chat experiments or just kept it the same as for the LLaMA-2-13B experiments.
@MichaelMtl Do you mean the prompt for the development set of the draft model? If so, for LLaMA-13B, LLaMA-13B-Chat, and LLaMA-70B, since they evaluate the same tasks, their development sets are the same, consisting of 8 items in total, 4 items each from CNN/DM and XSum.
Hi sorry I meant the system prompt, in particular what is done in the function clip_input. I believe the chat models were fine-tuned with a specific prompt (https://huggingface.co/blog/llama2#how-to-prompt-llama-2), so I was just wondering if you used something similar. I am running an experiment right now and I am getting a rouge-2 score of 0.143-0.144, so it seems like this was not changed or it has not made a difference.
Thanks!
Hi, thank you for sharing the code of your interesting research.
I have a question about how to adapt the Bayesian optimization method for the human-eval task. It seems like it only has a test set of 164 samples, so did you take 4-8 samples from this dataset and then test on the remaining samples, or did you use a different dataset for the optimization?
I was also wondering how many samples to use. If it would be possible to share this code that would be great, but if not answering these questions would be very helpful!