Question about Heads warmup

FasterDecoding / Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Apache License 2.0

2.28k stars 154 forks source link

Hi, I'm not an expert, so this might be a stupid question, but I have a question about the Heads warmup part of the Medusa paper. In that part it is mentioned to train the backbone first with medusa-1 loss in the first stage. When I read the paper referenced in that part(https://arxiv.org/abs/2202.10054), my guess is that it would be better to train the medusa head first. My questions are as follows

why fine-tune the backbone first?
does it really work to train backbone with medusa-1 loss while medusa head is initialized to 0 and frozen, since the output of medusa head would be 0 anyway? why?

FasterDecoding / Medusa

Question about Heads warmup #74