dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
Apache License 2.0
131 stars 8 forks source link

Proposal: Evaluating Faster, Deterministic Alternatives to Bayesian Optimization for Layer Skipping in Large Models #7

Closed azurespace closed 10 months ago

azurespace commented 10 months ago

Bayesian optimization can sometimes be very time-consuming, especially for models of significant size. I am interested in exploring how it compares in efficiency with faster methods.

Recent advancements in model quantization and pruning have revealed that weights associated with higher average magnitudes of activation in the test dataset are more critical. Building on this idea, it might be assumed that layers with greater changes in activation before and after passing through them have a more substantial impact on the final distribution of the model. Could a greedy approach of removing layers with lower importance be effective?

Additionally, using tools like Hessian matrices, we could directly calculate which layers, when removed, would have the most significant impact on the final outcome. This is also a famous technique that is used in papers like SparseGPT or GPTQ.

Considering skipping layers as a form of model pruning, it might be possible to apply discoveries from existing model pruning algorithms to this context.

junzhang-zj commented 10 months ago

@azurespace As for the operation of pruning based on changes before and after activation, we tried it in our experiments, but it was difficult to determine the threshold of the pruning layer between different layers, resulting in poor prediction consistency. In addition, we also try to adapt our decoding scheme and wanda sparsification and LLM.int8() quantization to further adapt to resource-limited users in subsequent work.