In this PR, we have added a speculative_generate function which performs speculative generation on the PagedLLaMA model using an MLPSpeculator. The scripts have also been updated to include a speculator_path in the case a user would like to perform speculative generate. Lastly, 2 functions were added to handle batch flattening/expansion and the attend function has been updated in the case the inputs have been flattened.
This PR is the final PR in a stack of PRs related to paged attention + speculative decoding:
Full implementation of the above can be found here: https://github.com/foundation-model-stack/fms-extras/pull/7
In this PR, we have added a speculative_generate function which performs speculative generation on the PagedLLaMA model using an MLPSpeculator. The scripts have also been updated to include a speculator_path in the case a user would like to perform speculative generate. Lastly, 2 functions were added to handle batch flattening/expansion and the attend function has been updated in the case the inputs have been flattened.