ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.61k stars 9.25k forks source link

Feature Request: Support for Meta: Multi Token Prediction Models #8297

Closed sorasoras closed 1 week ago

sorasoras commented 2 months ago

Prerequisites

Feature Description

Meta: Multi Token Prediction Models https://arxiv.org/abs/2404.19737 Concept and Architecture: Multi-token prediction trains LLMs to predict multiple future tokens simultaneously, rather than just the next token. The model architecture is based on the Transformer, but with multiple independent output heads - one for each token it aims to predict

Models: https://huggingface.co/facebook/multi-token-prediction

Speed: Models trained with this approach can be up to 3 times faster at inference time across various batch sizes

Motivation

Support for inference of this new arch

Possible Implementation

Implementation Details: The PyTorch state dictionaries are compatible with the Llama format. Additional prediction heads for future tokens are named "extra_heads" and can be ignored for standard autoregressive inference

foldl commented 1 month ago

I did it in chatllm.cpp. It is more than 2x faster on CPU. But quality degrades too.

https://github.com/foldl/chatllm.cpp/blob/master/docs/models.md#special-models

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.