dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
Apache License 2.0
117 stars 8 forks source link

Data on optimal layers to skip? #2

Closed KerfuffleV2 closed 9 months ago

KerfuffleV2 commented 9 months ago

Dear authors,

Your work looks very interesting! Thanks for making it available. Would you be willing to post examples of sets of layers you found to be optimal for skipping? The paper only shows performance based on the number of layers skipped, but it notes that the specific combination of layers skipped is important.

Even though the experiments were performed on a 13B LLaMA2, it would be interesting to see if they can be extrapolated to other models/sizes even if only as a starting point to find more optimal results.

Thanks for your time!

junzhang-zj commented 9 months ago

Of course, You can use attention layers [3, 5, 6, 8, 10, 11, 14, 15, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37] and MLP layers [6, 9, 10, 11, 15, 24, 25, 27, 28, 35] as a starting point to speed up the search, but the search results will have a certain relationship with the environment and device.

KerfuffleV2 commented 9 months ago

Thank you very much! I also appreciate the fast response.

(You can close this issue or leave it open if you think other people might want to see that information.)

gxy-gxy commented 9 months ago

@junzhang-zj Thanks for making it available. Would you be willing to post examples of sets of layers you found to be optimal for skipping in the llama-7b as well?

junzhang-zj commented 9 months ago

@gxy-gxy llama-7b? We have not tested this model. Generally, models above 13b will show obvious redundancy.

w32zhong commented 5 months ago

@gxy-gxy I guess a subset of the llama-7b model is insufficient to make good predictions even for easy tokens.