FasterDecoding / Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
https://sites.google.com/view/medusa-llm
Apache License 2.0
2.21k stars 150 forks source link

[Retraining] Use Liger Kernel to avoid multi-head logits materialization and scale the context length by N times #119

Open ByronHsu opened 3 weeks ago

ByronHsu commented 3 weeks ago

https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa

With the implementation of FusedLinearCrossEntropy and other kernels in Liger-Kernel, we are able to effectively reduce the memory while increase the throughput. We are happy to collaborate and integrate with our kernels!

image image

ByronHsu commented 3 weeks ago

cc @ctlllll @leeyeehoo @zhyncs