foundation-model-stack / fms-extras

Apache License 2.0
20 stars 9 forks source link

Speculative Generation e2e #10

Closed JRosenkranz closed 8 months ago

JRosenkranz commented 8 months ago

This PR is the final PR in a stack of PRs related to paged attention + speculative decoding:

Full implementation of the above can be found here: https://github.com/foundation-model-stack/fms-extras/pull/7

In this PR, we have added a speculative_generate function which performs speculative generation on the PagedLLaMA model using an MLPSpeculator. The scripts have also been updated to include a speculator_path in the case a user would like to perform speculative generate. Lastly, 2 functions were added to handle batch flattening/expansion and the attend function has been updated in the case the inputs have been flattened.