Turns out creating entire weights for the lm_heads costs a huge amount of VRAM (specially for multilingual models like Gemm) and is not necessary at all to get good speculation.
This PR modifies the legacy code to create new medusa models without duplicating this lm_head making it much more efficient to run. It also increments the version number of the config so users can know if how to actually run the model.
Turns out creating entire weights for the lm_heads costs a huge amount of VRAM (specially for multilingual models like Gemm) and is not necessary at all to get good speculation.
This PR modifies the legacy code to create new medusa models without duplicating this lm_head making it much more efficient to run. It also increments the version number of the config so users can know if how to actually run the model.