FasterDecoding / Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
https://sites.google.com/view/medusa-llm
Apache License 2.0
2.28k stars 154 forks source link

Creating medusa2. #97

Closed Narsil closed 6 months ago

Narsil commented 6 months ago

Turns out creating entire weights for the lm_heads costs a huge amount of VRAM (specially for multilingual models like Gemm) and is not necessary at all to get good speculation.

This PR modifies the legacy code to create new medusa models without duplicating this lm_head making it much more efficient to run. It also increments the version number of the config so users can know if how to actually run the model.

tim-a-davis commented 6 months ago

Good catch @Narsil. This will probably lead to more stable training as well.