Closed KerfuffleV2 closed 1 year ago
Of course, You can use attention layers [3, 5, 6, 8, 10, 11, 14, 15, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37] and MLP layers [6, 9, 10, 11, 15, 24, 25, 27, 28, 35] as a starting point to speed up the search, but the search results will have a certain relationship with the environment and device.
Thank you very much! I also appreciate the fast response.
(You can close this issue or leave it open if you think other people might want to see that information.)
@junzhang-zj Thanks for making it available. Would you be willing to post examples of sets of layers you found to be optimal for skipping in the llama-7b as well?
@gxy-gxy llama-7b? We have not tested this model. Generally, models above 13b will show obvious redundancy.
@gxy-gxy I guess a subset of the llama-7b model is insufficient to make good predictions even for easy tokens.
Dear authors,
Your work looks very interesting! Thanks for making it available. Would you be willing to post examples of sets of layers you found to be optimal for skipping? The paper only shows performance based on the number of layers skipped, but it notes that the specific combination of layers skipped is important.
Even though the experiments were performed on a 13B LLaMA2, it would be interesting to see if they can be extrapolated to other models/sizes even if only as a starting point to find more optimal results.
Thanks for your time!