Feature Request: Support for Meta: Multi Token Prediction Models

sorasoras commented 2 months ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Meta: Multi Token Prediction Models https://arxiv.org/abs/2404.19737 Concept and Architecture: Multi-token prediction trains LLMs to predict multiple future tokens simultaneously, rather than just the next token. The model architecture is based on the Transformer, but with multiple independent output heads - one for each token it aims to predict

Models: https://huggingface.co/facebook/multi-token-prediction

Speed: Models trained with this approach can be up to 3 times faster at inference time across various batch sizes

Motivation

Support for inference of this new arch

Possible Implementation

Implementation Details: The PyTorch state dictionaries are compatible with the Llama format. Additional prediction heads for future tokens are named "extra_heads" and can be ignored for standard autoregressive inference

foldl commented 1 month ago

I did it in chatllm.cpp. It is more than 2x faster on CPU. But quality degrades too.

https://github.com/foldl/chatllm.cpp/blob/master/docs/models.md#special-models

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp