Closed cyLi-Tiger closed 2 months ago
Indeed, self.act is not being used; it's not for preventing the divergence of hidden_states. The input to the draft model should be the features of the base model. Long-term error accumulation might lead to hidden_states. However, for the draft model, guessing a few tokens is enough, so this issue shouldn't arise.
I notice here you have unused
self.act
, what's the point here?I try to use the draft model(with only one layer of transformer) to inference, and use the last_hidden from this round's output as next round's input. But find the
hidden_states
get larger sometimes and cause nan inhidden_states
as the auto regressive process going on. Did this happen to you? I'm guessing theself.act
is used to avoid hidden_states from exploding, and since in speculative decoding the draft model only needs to decode several tokens, we don't need to worry about such explosion. Looking forward to your thoughts!