Questions about modeling_llama.py

qiyuangong commented 9 months ago

Nice repo and paper!

Self-speculative decoding implementation is quite clear and straightforward compared to the Hugging Face Assistant model implementation. :)

I made some tests on 1K input. The result seems promising.

But, I have a few questions about modeling_llama.py.

Will these changes be upstream to HF transformers? Such that we can apply it to other models, e.g., Qwen, YI, etc.
Found some code not necessary for inference. I don't know if these codes are designed for future work.
- bitfit_linear_forward is not used.
- hidden_states.requires_grad_(True) seems unnecessary for inference.
- draft_attn_skip_mask and draft_mlp_skip_mask are not used.
This line seems not correct (https://github.com/dilab-zju/self-speculative-decoding/blob/main/modeling_llama.py#L379 https://github.com/huggingface/transformers/blob/v4.33.1/src/transformers/models/llama/modeling_llama.py#L697).

LorrinWWW commented 8 months ago

Thank you for your interests in our work!

At present, we have no plans to integrate this repository into transformers. We envision this repository more as a tool for studying speculative decoding rather than a ready-to-deploy framework. However, we are open to contributions from those interested in re-implementing, testing, or applying this concept across a wider range of models. The custom modeling code primarily allows for the skipping of intermediate layers, making it relatively straightforward to adapt to other models.
Yeah, these are not used :) We tried bitfit internally but observed some slowdown so we did not go further. hidden_states.requires_grad_(True) is not necessary for inference.
Thank you for reporting. It does look like something off. @junzhang-zj Can you confirm this and update the code?

junzhang-zj commented 8 months ago

Thank you for your report, I have updated the code @qiyuangong.

qiyuangong commented 8 months ago

Thank you for your report, I have updated the code @qiyuangong.

You are welcome! :)

dilab-zju / self-speculative-decoding