Support Speculative Decoding

OpenNMT / CTranslate2

Fast inference engine for Transformer models

https://opennmt.net/CTranslate2

MIT License

3.02k stars 268 forks source link

Support Speculative Decoding #1474

Open JOHW85 opened 9 months ago

JOHW85 commented 9 months ago

This could be used for LLMs and hopefully for encoder-decoder models like using the smaller NLLB coupled with the bigger NLLB models

wsxiaoys commented 9 months ago

This looks be a duplicate of #1234

guillaumekln commented 9 months ago

It's the same idea but I'm not sure it refers to the same implementation? There is also "Speculative sampling" which seem to refer to yet another implementation/algorithm of this concept.

epinnock commented 9 months ago

How hard would it be to implement a really naive version of this with ctranslate2? I would like to pick this up if possible

guillaumekln commented 9 months ago

Implementing this feature in the most basic form may be already possible with the existing Generator API. You could use generate_batch with a small model, and then use forward_batch with a big model to validate the output. The limitation of this approach is that when the big model does not agree, you have to start the generation from scratch and not at the first mismatched position.