Speculative decoding for `llama` and `gpt_bigcode`

IBM / text-generation-inference

IBM development fork of https://github.com/huggingface/text-generation-inference

Apache License 2.0

52 stars 30 forks source link

Speculative decoding for `llama` and `gpt_bigcode` #79

Closed tdoublep closed 5 months ago

tdoublep commented 5 months ago

Motivation

This PR adds support for speculative decoding for llama and gpt_bigcode models.

Modifications

It introduces a new model type and batch type (following the same pattern as for the Flash models). The speculator and the KV cache manager are imported from fms_extras package.

Result

tbd

Related Issues

tbd