ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.03k stars 8.7k forks source link

Support for RecurrentGemma (Gemma with Griffin Architecture) #6564

Open TechxGenus opened 2 months ago

TechxGenus commented 2 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Google’s newly released model, a hybrid architecture based on attention and hidden state: https://huggingface.co/google/recurrentgemma-2b

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

A good and open LLM with novel architecture

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Unlike Jamba (#6372), the model is very small and can be used by most computers for inference. Hybrid architecture is likely to be the trend of the future. Hope that llama.cpp can support it and other hybrid architectures (if possible).

vasileermicioi commented 2 months ago

this is the PR merged in gemma.cpp https://github.com/google/gemma.cpp/pull/136/files

coder543 commented 1 month ago

This issue was marked as stale, but shouldn’t supporting more efficient architectures be a priority?

fat-tire commented 1 month ago

Will this or Griffin be in the upcoming Gemma 2 model(s)? I say "this or griffin" as the paper mentions a slight difference between recurrentgemma and griffin FWIW:

We make only a single modification to the Griffin architecture (De et al., 2024), which is to multiply the input embeddings by a constant equal to the square root of model width. The input and output embeddings are tied, but this factor is not applied to the output.

Seems like Griffin or a variant could be the "brand new architecture designed for breakthrough performance and efficiency" in today's Gemma 2 announcement, no?

coder543 commented 2 weeks ago

Now Google has released a 9B version of RecurrentGemma (arxiv link), which seems to score similarly to Gemma-7b, while supposedly being far more efficient:

max_throughput

(source)

Any chance llama.cpp can support RecurrentGemma @ggerganov? I wish I had the skill to implement it myself here, but I have no familiarity with llama.cpp's inner workings, I'm just a user of the software.

ggerganov commented 2 weeks ago

Will be added, though we probably have to merge Jamba (https://github.com/ggerganov/llama.cpp/pull/7531) and then see how to adapt llama_cache to support the new Griffin layers

DuckyBlender commented 1 week ago

Great news. People often forget about more efficient architectures, supporting this will speed so many things up!

Meshwa428 commented 3 days ago

Is recurrent Gemma going to come to ollama or not?

It uses Google's custom architecture named Griffin right?