| Rust Documentation | Python Documentation | Discord |
Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Please submit requests for new models here.
1) Install
2) Get models
3) Deploy with our easy to use APIs
After following installation instructions
π₯π§ AnyMoE: Build a memory-efficient MoE model from anything, in seconds
./mistralrs_server -i toml -f toml-selectors/anymoe_lora.toml
π Run the Gemma 2 model
./mistralrs_server -i plain -m google/gemma-2-9b-it -a gemma2
ΟΒ³ Run the Phi 3 model with 128K context window
./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
ΟΒ³ π· Run the Phi 3 vision model: documentation and guide here
./mistralrs_server --port 1234 vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
Other models: see a support matrix and how to run them
Mistal.rs supports several model categories:
Fast:
Accelerator support:
mkl
, accelerate
support and optimized backend.Easy:
.safetensors
models directly from Hugging Face Hub by quantizing them after loading instead of creating a GGUF file.
Powerful:
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c
Note: See supported models for more information
Model | Supports quantization | Supports adapters | Supports device mapping | Supported by AnyMoE |
---|---|---|---|---|
Mistral v0.1/v0.2/v0.3 | β | β | β | β |
Gemma | β | β | β | β |
Llama 2/3 | β | β | β | β |
Mixtral | β | β | β | |
Phi 2 | β | β | β | β |
Phi 3 | β | β | β | β |
Qwen 2 | β | β | β | |
Phi 3 Vision | β | β | β | |
Idefics 2 | β | β | β | |
Gemma 2 | β | β | β | β |
Starcoder 2 | β | β | β | β |
LLaVa Next | β | β | β | |
LLaVa | β | β | β |
Rust multithreaded/async API for easy integration into any application.
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
Python API for mistral.rs.
OpenAI API compatible API server
cuda
feature: --features cuda
flash-attn
feature, only applicable to non-quantized models: --features flash-attn
cudnn
feature: --features cudnn
metal
feature: --features metal
mkl
feature: --features mkl
accelerate
feature: --features accelerate
Enabling features is done by passing --features ...
to the build system. When using cargo run
or maturin develop
, pass the --features
flag before the --
separating build flags from runtime flags.
metal
: cargo build --release --features metal
.cargo build --release --features "cuda flash-attn cudnn"
.Device | Mistral.rs Completion T/s | Llama.cpp Completion T/s | Model | Quant |
---|---|---|---|---|
A10 GPU, CUDA | 89 | 83 | mistral-7b | 4_K_M |
Intel Xeon 8358 CPU, AVX | 11 | 23 | mistral-7b | 4_K_M |
Raspberry Pi 5 (8GB), Neon | 2 | 3 | mistral-7b | 2_K |
A100 GPU, CUDA | 119 | 102 | mistral-7b | 4_K_M |
A6000 GPU, CUDA | 115 | 102 | mistral-7b | 4_K_M |
Please submit more benchmarks via raising an issue!
Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
Note: You can use pre-built
mistralrs-server
binaries here
1) Install required packages
OpenSSL
(Example on Ubuntu: sudo apt install libssl-dev
)pkg-config
(Example on Ubuntu: sudo apt install pkg-config
)2) Install Rust: https://rustup.rs/
*Example on Ubuntu:*
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
3) Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the token_source
parameters in Python or the command line.)
huggingface-cli
as documented here.
huggingface-cli login
4) Download the code
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
5) Build or install
cargo build --release
cargo build --release --features cuda
cargo build --release --features "cuda flash-attn"
cargo build --release --features metal
cargo build --release --features accelerate
cargo build --release --features mkl
Install with cargo install
for easy command line usage
Pass the same values to `--features` as you would for `cargo build`
```bash
cargo install --path mistralrs-server --features cuda
```
6) The build process will output a binary misralrs-server
at ./target/release/mistralrs-server
which may be copied into the working directory with the following command:
Example on Ubuntu:
cp ./target/release/mistralrs-server ./mistralrs_server
7) Use our APIs and integrations
[APIs and integrations list](#apis-and-integrations)
There are 2 ways to run a model with mistral.rs:
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
literal:<value>
: Load from a specified literalenv:<value>
: Load from a specified environment variablepath:<value>
: Load from a specified filecache
: default: Load from the HF token at ~/.cache/huggingface/token or equivalent.none
: Use no HF tokenThis is passed in the following ways:
./mistralrs_server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Here is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using none
).
You can also instruct mistral.rs to load models fully locally by modifying the *_model_id
arguments or options:
./mistralrs_server --port 1234 plain -m . -a mistral
Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
--model-id
(server) or model_id
(python/rust) or --tok-model-id
(server) or tok_model_id
(python/rust):
config.json
tokenizer_config.json
tokenizer.json
(if not specified separately).safetensors
files.preprocessor_config.json
(required for vision models).processor_config.json
(optional for vision models).--quantized-model-id
(server) or quantized_model_id
(python/rust):
.gguf
or .ggml
file.--x-lora-model-id
(server) or xlora_model_id
(python/rust):
xlora_classifier.safetensors
xlora_config.json
.safetensors
and adapter_config.json
files in their respective directories--adapters-model-id
(server) or adapters_model_id
(python/rust):
.safetensors
and adapter_config.json
files in their respective directoriesTo run GGUF models fully locally, the only mandatory arguments are the quantized model ID and the quantized filename.
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
you do not need to specify the tokenizer model ID argument and instead should pass a path to the
chat template JSON file (examples here, you will need to create your own by specifying the chat template and bos
/eos
tokens) as well as specifying a local model ID. For example:
./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3-mini-128k-instruct-q4_K_M.gguf
If you do not specify a chat template, then the --tok-model-id
/-t
tokenizer model ID argument is expected where the tokenizer_config.json
file should be provided. If that model ID contains a tokenizer.json
, then that will be used over the GGUF tokenizer.
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
Supported GGUF tokenizer types
llama
(sentencepiece)gpt2
(BPE)Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>
. Please run ./mistralrs_server --help
to see the subcommands.
Additionally, for models without quantization, the model architecture should be provided as the --arch
or -a
argument in contrast to GGUF models which encode the architecture in the file.
Note: for plain models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (plain
).
mistral
gemma
mixtral
llama
phi2
phi3
qwen2
gemma2
starcoder2
Note: for vision models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (vision-plain
).
phi3v
idefics2
llava_next
llava
Interactive mode:
You can launch interactive mode, a simple chat application running in the terminal, by passing -i
:
./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Interactive mode for vision models:
You can launch interactive mode for vision models, a simple chat application running in the terminal, by passing -i
:
./mistralrs_server --vi plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
To start an X-LoRA server with the exactly as presented in the paper:
./mistralrs_server --port 1234 x-lora-plain -o orderings/xlora-paper-ordering.json -x lamm-mit/x-lora
To start an LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):
./mistralrs_server --port 1234 lora-gguf -o orderings/xlora-paper-ordering.json -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf -a lamm-mit/x-lora
Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.
To start a server running Mistral from GGUF:
./mistralrs_server --port 1234 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
To start a server running Llama from GGML:
./mistralrs_server --port 1234 ggml -t meta-llama/Llama-2-13b-chat-hf -m TheBloke/Llama-2-13B-chat-GGML -f llama-2-13b-chat.ggmlv3.q4_K_M.bin
To start a server running Mistral from safetensors.
./mistralrs_server --port 1234 plain -m mistralai/Mistral-7B-Instruct-v0.1 -a mistral
.toml
fileWe provide a method to select models with a .toml
file. The keys are the same as the command line, with no_kv_cache
and tokenizer_json
being "global" keys.
Example:
./mistralrs_server --port 1234 toml -f toml-selectors/gguf.toml
Quantization support | Model | GGUF | GGML | ISQ |
---|---|---|---|---|
Mistral | β | β | ||
Gemma | β | |||
Llama | β | β | β | |
Mixtral | β | β | ||
Phi 2 | β | β | ||
Phi 3 | β | β | ||
Qwen 2 | β | |||
Phi 3 Vision | β | |||
Idefics 2 | β | |||
Gemma 2 | β | |||
Starcoder 2 | β | |||
LLaVa Next | β | |||
LLaVa | β |
Device mapping support | Model category | Supported |
---|---|---|
Plain | β | |
GGUF | β | |
GGML | ||
Vision Plain | β |
X-LoRA and LoRA support | Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
---|---|---|---|---|
Mistral | β | β | ||
Gemma | β | |||
Llama | β | β | β | |
Mixtral | β | β | ||
Phi 2 | β | |||
Phi 3 | β | β | ||
Qwen 2 | ||||
Phi 3 Vision | ||||
Idefics 2 | ||||
Gemma 2 | β | |||
Starcoder 2 | β | |||
LLaVa Next | ||||
LLaVa |
AnyMoE support | Model | AnyMoE |
---|---|---|
Mistral 7B | β | |
Gemma | β | |
Llama | β | |
Mixtral | β | |
Phi 2 | β | |
Phi 3 | β | |
Qwen 2 | β | |
Phi 3 Vision | ||
Idefics 2 | ||
Gemma 2 | β | |
Starcoder 2 | β | |
LLaVa Next | β | |
LLaVa | β |
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help
after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json
file, it is not necessary to specify. Otherwise, templates can be found in chat_templates
and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
For example, when using a Zephyr model:
./mistralrs_server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-*
architecture, and LoRA support by selecting the lora-*
architecture. Please find docs for adapter models here
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
MISTRALRS_DEBUG=1
causes the following things
mistralrs_gguf_tensors.txt
or mistralrs_ggml_tensors.txt
NVCC_CCBIN
environment variable during build.recompile with -fPIE
:
-fPIE
.CUDA_NVCC_FLAGS
environment variable to -fPIE
during build: CUDA_NVCC_FLAGS=-fPIE
CUDA_ERROR_NOT_FOUND
or symbol not found when using a normal or vison model:
f32
, f16
, bf16
or auto
to choose based on the device.This project would not be possible without the excellent work at candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.