Proposal: Evaluating Faster, Deterministic Alternatives to Bayesian Optimization for Layer Skipping in Large Models

Bayesian optimization can sometimes be very time-consuming, especially for models of significant size. I am interested in exploring how it compares in efficiency with faster methods.

Recent advancements in model quantization and pruning have revealed that weights associated with higher average magnitudes of activation in the test dataset are more critical. Building on this idea, it might be assumed that layers with greater changes in activation before and after passing through them have a more substantial impact on the final distribution of the model. Could a greedy approach of removing layers with lower importance be effective?

Additionally, using tools like Hessian matrices, we could directly calculate which layers, when removed, would have the most significant impact on the final outcome. This is also a famous technique that is used in papers like SparseGPT or GPTQ.

Considering skipping layers as a form of model pruning, it might be possible to apply discoveries from existing model pruning algorithms to this context.

dilab-zju / self-speculative-decoding

Proposal: Evaluating Faster, Deterministic Alternatives to Bayesian Optimization for Layer Skipping in Large Models #7