Bayesian Optimization Search Method for CodeLlama and human-eval

dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**

Apache License 2.0

117 stars 8 forks source link

Bayesian Optimization Search Method for CodeLlama and human-eval #17

Closed MichaelMtl closed 1 month ago

MichaelMtl commented 1 month ago

Hi, thank you for sharing the code of your interesting research.

I have a question about how to adapt the Bayesian optimization method for the human-eval task. It seems like it only has a test set of 164 samples, so did you take 4-8 samples from this dataset and then test on the remaining samples, or did you use a different dataset for the optimization?

I was also wondering how many samples to use. If it would be possible to share this code that would be great, but if not answering these questions would be very helpful!

junzhang-zj commented 1 month ago

@MichaelMtl I apologize for the delay in responding. Thank you for supporting our work. For the CodeLlama test, we selected 4 to 8 Python-related samples from the StarCoder data to build the draft model.

MichaelMtl commented 1 month ago

OK thank you for clarifying this!

I was also wondering if you changed the prompt for the LLaMA-2-13B-Chat experiments or just kept it the same as for the LLaMA-2-13B experiments.

junzhang-zj commented 1 month ago

@MichaelMtl Do you mean the prompt for the development set of the draft model? If so, for LLaMA-13B, LLaMA-13B-Chat, and LLaMA-70B, since they evaluate the same tasks, their development sets are the same, consisting of 8 items in total, 4 items each from CNN/DM and XSum.

MichaelMtl commented 1 month ago

Hi sorry I meant the system prompt, in particular what is done in the function clip_input. I believe the chat models were fine-tuned with a specific prompt (https://huggingface.co/blog/llama2#how-to-prompt-llama-2), so I was just wondering if you used something similar. I am running an experiment right now and I am getting a rouge-2 score of 0.143-0.144, so it seems like this was not changed or it has not made a difference.

Thanks!