Open TechxGenus opened 2 months ago
this is the PR merged in gemma.cpp https://github.com/google/gemma.cpp/pull/136/files
This issue was marked as stale, but shouldn’t supporting more efficient architectures be a priority?
Will this or Griffin be in the upcoming Gemma 2 model(s)? I say "this or griffin" as the paper mentions a slight difference between recurrentgemma and griffin FWIW:
We make only a single modification to the Griffin architecture (De et al., 2024), which is to multiply the input embeddings by a constant equal to the square root of model width. The input and output embeddings are tied, but this factor is not applied to the output.
Seems like Griffin or a variant could be the "brand new architecture designed for breakthrough performance and efficiency" in today's Gemma 2 announcement, no?
Now Google has released a 9B version of RecurrentGemma (arxiv link), which seems to score similarly to Gemma-7b, while supposedly being far more efficient:
(source)
Any chance llama.cpp can support RecurrentGemma @ggerganov? I wish I had the skill to implement it myself here, but I have no familiarity with llama.cpp's inner workings, I'm just a user of the software.
Will be added, though we probably have to merge Jamba (https://github.com/ggerganov/llama.cpp/pull/7531) and then see how to adapt llama_cache
to support the new Griffin layers
Great news. People often forget about more efficient architectures, supporting this will speed so many things up!
Is recurrent Gemma going to come to ollama or not?
It uses Google's custom architecture named Griffin right?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do as an enhancement.Google’s newly released model, a hybrid architecture based on attention and hidden state: https://huggingface.co/google/recurrentgemma-2b
Motivation
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to
llama.cpp
users.A good and open LLM with novel architecture
Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
Unlike Jamba (#6372), the model is very small and can be used by most computers for inference. Hybrid architecture is likely to be the trend of the future. Hope that llama.cpp can support it and other hybrid architectures (if possible).