ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.42k stars 9.22k forks source link

Feature Request: Support Codestral Mamba #8519

Open VelocityRa opened 1 month ago

VelocityRa commented 1 month ago

Feature Description

New 7B coding model just released by Mistral.

Motivation

Seems to perform very well, especially for a 7B model:

image

Possible Implementation

An extension to https://github.com/ggerganov/llama.cpp/issues/7727?

HanClinto commented 1 month ago

I love the shout-out in the linked blog post!

You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.

That's a really nice nod -- love to see it!

theo77186 commented 1 month ago

7727 should cover for this model, but with untied embeddings unlike the other Mamba2 models.

timlacroix commented 1 month ago

FYI, there is an "ngroups" param that changes how layer norm is done : https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/mamba2.py#L47

We use ngroups=8. If you forget it or try with ngroups = 1 you'll have a bad time.

Good luck !

ggerganov commented 1 month ago

After we merge https://github.com/ggerganov/llama.cpp/pull/8526 we should try to add full support for this model. cc @compilade

0wwafa commented 1 month ago

I'd love this.

txhno commented 1 month ago

thanks!

fredconex commented 1 month ago

Hey guys, any progress on ETA for it?

rmusser01 commented 2 weeks ago

For anyone else, seems this is waiting on https://github.com/ggerganov/llama.cpp/pull/8526 which is waiting on https://github.com/ggerganov/llama.cpp/pull/8980 -> which is waiting on review(?).

compilade commented 2 weeks ago

Some progress report: I have a local branch (not yet public) on top of #8526 in which I've started implementing the graph for Mamba-2. The conv step is very similar to Mamba-1, and I've started to implement the SSM step and will continue in the next days. It's not in a usable state yet.

I'm starting by implementing the fully recurrent mode of Mamba-2 (which is very similar to Mamba-1) (and which is described in Section 3.4.1).

But I'm still evaluating how the block decomposition would fit within how src/llama.cpp manages batches and/or if the chunk size should be dynamic. It seems like to fully benefit from Section 6, the chunks should be smaller than the batch size, but not too small, at which point directly doing the recurrence is the same. Since the ggml compute graph nodes should keep the same structure between batches and that the block decomposition will likely have too much overhead for small batches, it's easier to simply go with the linear recurrence with something like ggml_ssm_scan at first.

For the ETA, I'll try to get it working before the end of August, but no promises.

(and BTW @rmusser01, #8980 is waiting on #8526, not the other way around, at least I think?)

compilade commented 2 weeks ago

Okay, the fully recurrent mode works for Mamba-2! (for the curious, see this branch: https://github.com/compilade/llama.cpp/tree/compilade/mamba2) I'll open a PR soon (in the next days; still need to clean up some things).

Note that Mamba-Codestral-7B-v0.1 cannot be converted as-is; either use https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/discussions/9, or rename consolidated.safetensors to model.safetensors, tokenizer.model.v3 to tokenizer.model, and params.json to config.json. Then, in config.json, the line "architectures": ["Mamba2ForCausalLM"], needs to be added (if missing).

The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes 263.5 MiB (in F32) per sequence (e.g. with -np 1), compared to 38 MiB for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size.

A big downside right now with recurrent models in llama.cpp is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if using llama-server. I think using llama-cli in conversation mode does not have this problem, however (or maybe only the bare interactive mode with --in-prefix and --in-suffix, not sure).

The implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of Mamba-2-130M is similar or better than Mamba-130M (but still not that fast compared to transformer-based models with an empty context).

The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.

Just making sure expectations are not too far from reality.