Closed qiyuangong closed 8 months ago
Thank you for your interests in our work!
transformers
. We envision this repository more as a tool for studying speculative decoding rather than a ready-to-deploy framework. However, we are open to contributions from those interested in re-implementing, testing, or applying this concept across a wider range of models. The custom modeling code primarily allows for the skipping of intermediate layers, making it relatively straightforward to adapt to other models.hidden_states.requires_grad_(True)
is not necessary for inference.Thank you for your report, I have updated the code @qiyuangong.
Thank you for your report, I have updated the code @qiyuangong.
You are welcome! :)
Nice repo and paper!
Self-speculative decoding implementation is quite clear and straightforward compared to the Hugging Face Assistant model implementation. :)
I made some tests on 1K input. The result seems promising.
But, I have a few questions about
modeling_llama.py
.bitfit_linear_forward
is not used.hidden_states.requires_grad_(True)
seems unnecessary for inference.draft_attn_skip_mask
anddraft_mlp_skip_mask
are not used.