This PR adds support for speculative decoding for llama and gpt_bigcode models.
Modifications
It introduces a new model type and batch type (following the same pattern as for the Flash models). The speculator and the KV cache manager are imported from fms_extras package.
Motivation
This PR adds support for speculative decoding for
llama
andgpt_bigcode
models.Modifications
It introduces a new model type and batch type (following the same pattern as for the Flash models). The speculator and the KV cache manager are imported from
fms_extras
package.Result
tbd
Related Issues
tbd