dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
Apache License 2.0
138 stars 9 forks source link

The experimental results are inconsistent. #20

Open HuYunhai-Alex opened 1 month ago

HuYunhai-Alex commented 1 month ago

Using the skiplayer provided by the project to run CodeLlama2-13B and LLaMA2-13B-Chat, the speculated decode time in evaluate_sum and evaluate_code is significantly longer than the base model. Could you please explain why this might be the case?

image
HimanshuJanbandhu commented 1 month ago

Can you elaborate on what system you are doing this? As I can see matchness is quite high, so this problem shouldn't occur

junzhang-zj commented 1 month ago

Yes, the acceptance rate is normal and should be accelerated. Can you rule out whether it is a problem with the sss mode and try essg? In addition, you can update the environment and re-search the skipped layers.