EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.27k stars 238 forks source link

Question about some commands #430

Closed nomopo45 closed 1 month ago

nomopo45 commented 2 months ago

Hello,

Thanks for the project it look really nice ! i'm new in this world and i'm struggling to do what i want.

I have a macbook m1 pro 16gb. I managed to install and make it run but some things i would like to do don't work.

First i use this command :

./mistralrs_server --port 1234 gguf -t /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-GGUF.tmpl -m /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF -f Meta-Llama-3-8B-Instruct-Q8_0.gguf

from my understanding the -t refers to tokenizer_config.json ? the -m would be the path of the model ? and -f the actual model gguf file name ?

Once the command runs it says :

2024-06-14T03:32:40.527643Z  INFO mistralrs_server: Model loaded.
2024-06-14T03:32:40.528086Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

So i tried a curl but it's super long to get a reply like ~10sec:

curl http://0.0.0.0:1234/v1/chat/completions \                              
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Meta-Llama-3-8B-Instruct-GGUF",
"messages": [
    {
        "role": "system",
        "content": "You are Mistral.rs, an AI assistant."
    },
    {
        "role": "user",
        "content": "Hello, World!"
    }
]
}'

{"id":"2","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello! Nice to meet you! Is there something I can help you with today? I'm Mistral.rs, your friendly AI assistant.","role":"assistant"},"logprobs":null}],"created":1718336732,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":29,"prompt_tokens":29,"total_tokens":58,"avg_tok_per_sec":3.0154934,"avg_prompt_tok_per_sec":4.3071437,"avg_compl_tok_per_sec":2.3198144,"total_time_sec":19.234,"total_prompt_time_sec":6.733,"total_completion_time_sec":12.501}}%

Is it normal for it to be so slow ?

now i tried to have an interactive but all the command i used failed could someone guide me on how to get an interactive chat ?

Could someone explain to me what is LoRa and X-Lora what it means what they do, why should we use them ?

I'm also using LMstudio, but i wanted to give a try to this project since it seems the token/s were a strong point of this porject.

Anyway thanks in advance for any replies and help!

PS: here is what's inside the /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/

Meta-Llama-3-8B-Instruct-GGUF  tree                                                           
.
├── Meta-Llama-3-8B-Instruct-Q8_0.gguf
├── tokenizer.json
└── tokenizer_config.json

1 directory, 3 files
EricLBuehler commented 2 months ago

Hi @nomopo45! Happy to answer any questions.

from my understanding the -t refers to tokenizer_config.json ? the -m would be the path of the model ? and -f the actual model gguf file name ?

Yes.

So i tried a curl but it's super long to get a reply like ~10sec:

The first request will always be a bit slower as the tensors are loaded into memory. Some other systems opt to do this during startup, but we did not do that to ensure fast startup times. If this problem continues past the first request, ensure that you are building with the correct hardware accelerator, if you have one. Note that the performance will decrease as the prompt length goes on, but that is a known side effect of chat models, and we are actually working on a feature to optimize this (#350, #366).

now i tried to have an interactive but all the command i used failed could someone guide me on how to get an interactive chat ?

Absolutely! I'm not sure what your hardware is, but here are a few examples (I merged the run with the build command using cargo run). To use interactive mode, you replace --port xxxx with -i.

Could someone explain to me what is LoRa and X-Lora what it means what they do, why should we use them ?

LoRA is a popular technique for fine-tuning models, which involves efficiently training adapters. X-LoRA is a mixture of LoRA experts, and uses dense gating to choose experts. See the X-LoRA paper. X-LoRA and LoRA are distinct methods, but they both improve performance of the model (and at zero temporal cost for LoRA).

nomopo45 commented 2 months ago

Hello thanks a lot for your answers !

So I have a macbook m1 pro 16gb.

I give a try yo interactive chat using your command, after seeing your command and some i feel it's not possible to have an interactive chat with gguf file am i right ?

Then i donwload this repo : https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/tree/main

and started the following command:

cargo run --release --features metal -- -i plain -m /Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5 -a llama
    Finished `release` profile [optimized] target(s) in 0.91s
     Running `target/release/mistralrs-server -i plain -m /Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5 -a llama`
2024-06-17T02:03:04.951259Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-06-17T02:03:04.951311Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-06-17T02:03:04.951359Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-06-17T02:03:04.951403Z  INFO hf_hub: Token file not found "/Users/xxx/.cache/huggingface/token"
2024-06-17T02:03:04.951441Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/xxx/.cache/huggingface/token", using no HF token.
2024-06-17T02:03:04.951618Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5`
2024-06-17T02:03:05.333603Z  INFO mistralrs_core::pipeline::normal: Loading `"tokenizer.json"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/tokenizer.json"`
2024-06-17T02:03:05.333636Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5`
2024-06-17T02:03:05.618104Z  INFO mistralrs_core::pipeline::normal: Loading `"config.json"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/config.json"`
2024-06-17T02:03:06.208592Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00005-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00005-of-00007.safetensors"`
2024-06-17T02:03:06.488109Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00001-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00001-of-00007.safetensors"`
2024-06-17T02:03:06.754389Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00004-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00004-of-00007.safetensors"`
2024-06-17T02:03:07.037357Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00007-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00007-of-00007.safetensors"`
2024-06-17T02:03:07.329491Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00002-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00002-of-00007.safetensors"`
2024-06-17T02:03:07.602342Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00006-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00006-of-00007.safetensors"`
2024-06-17T02:03:07.884465Z  INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00003-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00003-of-00007.safetensors"`
2024-06-17T02:03:08.742408Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5`
2024-06-17T02:03:09.039340Z  INFO mistralrs_core::pipeline::normal: Loading `"tokenizer_config.json"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/tokenizer_config.json"`
2024-06-17T02:03:09.040618Z  INFO mistralrs_core::utils::normal: DType selected is F32.
2024-06-17T02:03:09.040724Z  INFO mistralrs_core::pipeline::normal: Loading model `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5` on Metal(MetalDevice(DeviceId(1)))...
2024-06-17T02:03:09.041088Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 8192 }
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:04<00:00, 4.86it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:05<00:00, 6.17it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:09<00:00, 7.90it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456/456 [00:12<00:00, 143.62it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:12<00:00, 5.93it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:10<00:00, 4.46it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:09<00:00, 12.17it/s]
Error: cannot find tensor model.embed_tokens.weight

So as you see it ends with an error and i'm not what is this model.embed_tokens.weight ?

Now about the curl here is the command i used to the build :

cargo build --release --features metal

then here is the command i use to start the server :

./mistralrs_server --port 1234 gguf -t /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/ -m /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF -f Meta-Llama-3-8B-Instruct-Q8_0.gguf
2024-06-17T02:12:33.837370Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-06-17T02:12:33.837438Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-06-17T02:12:33.837475Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-06-17T02:12:33.837534Z  INFO hf_hub: Token file not found "/Users/xxx/.cache/huggingface/token"
2024-06-17T02:12:33.837588Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/xxx/.cache/huggingface/token", using no HF token.
2024-06-17T02:12:33.837786Z  INFO mistralrs_core::pipeline::gguf: Loading `tokenizer_config.json` at `/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/` because no chat template file was specified.
2024-06-17T02:12:34.203812Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer_config.json"` locally at `"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tokenizer_config.json"`
2024-06-17T02:12:34.204364Z  INFO hf_hub: Token file not found "/Users/xxx/.cache/huggingface/token"
2024-06-17T02:12:34.204423Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/xxx/.cache/huggingface/token", using no HF token.
2024-06-17T02:12:34.499030Z  INFO mistralrs_core::pipeline::paths: Loading `"Meta-Llama-3-8B-Instruct-Q8_0.gguf"` locally at `"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf"`
2024-06-17T02:12:35.451508Z  INFO mistralrs_core::pipeline::gguf: Loading `tokenizer.json` at `/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/`
2024-06-17T02:12:36.719412Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tokenizer.json"`
2024-06-17T02:12:36.719565Z  INFO mistralrs_core::pipeline::gguf: Loading model `/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/` on Metal(MetalDevice(DeviceId(1)))...
2024-06-17T02:12:37.249332Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 7
general.name: Meta-Llama-3-8B-Instruct-imatrix
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 8192
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-06-17T02:12:43.418126Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|end_of_text|>", "<|eot_id|>", unk_tok = `None`
2024-06-17T02:12:43.428386Z  INFO mistralrs_server: Model loaded.
2024-06-17T02:12:43.429149Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

and then i do my curl command, i did the same command 3 times to show you the time it take:

{"id":"0","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello there! It's lovely to meet you! I'm Mistral.rs, a language model AI assistant, here to help and assist you to the best of my abilities. What's on your mind? Do you have any questions, topics you'd like to discuss, or perhaps some tasks you'd like me to help you with? I'm all ears (or rather, all text)!","role":"assistant"},"logprobs":null}],"created":1718336660,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":79,"prompt_tokens":29,"total_tokens":108,"avg_tok_per_sec":4.654972,"avg_prompt_tok_per_sec":2.3949127,"avg_compl_tok_per_sec":7.12225,"total_time_sec":23.201,"total_prompt_time_sec":12.109,"total_completion_time_sec":11.092}}

{"id":"1","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello there! It's great to meet you! I'm Mistral.rs, your friendly AI assistant. How can I help you today? Do you have a question, or perhaps a topic you'd like to discuss?","role":"assistant"},"logprobs":null}],"created":1718336692,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":45,"prompt_tokens":29,"total_tokens":74,"avg_tok_per_sec":6.2153535,"avg_prompt_tok_per_sec":10.079945,"avg_compl_tok_per_sec":4.9839406,"total_time_sec":11.906,"total_prompt_time_sec":2.877,"total_completion_time_sec":9.029}}

{"id":"2","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello! Nice to meet you! Is there something I can help you with today? I'm Mistral.rs, your friendly AI assistant.","role":"assistant"},"logprobs":null}],"created":1718336732,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":29,"prompt_tokens":29,"total_tokens":58,"avg_tok_per_sec":3.0154934,"avg_prompt_tok_per_sec":4.3071437,"avg_compl_tok_per_sec":2.3198144,"total_time_sec":19.234,"total_prompt_time_sec":6.733,"total_completion_time_sec":12.501}}

Completion seems quite slow but maybe it's normal for a computer like mine just would like to have your opinion on this.

Thanks a lot!

EricLBuehler commented 2 months ago

Hi @nomopo45!

I give a try yo interactive chat using your command, after seeing your command and some i feel it's not possible to have an interactive chat with gguf file am i right ?

No, you can:

./mistralrs_server -i gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Then i donwload this repo : https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/tree/main

The CPM model architecture (non-GGUF) is not compatible with the "plain" llama architecture, it would be necessary to add a mini-cpm architecture to the vision-plain category.

Completion seems quite slow but maybe it's normal for a computer like mine just would like to have your opinion on this.

I have seen similar results on similar hardware. This performance is not great though, could you please try to build & rerun on the CPU?

EricLBuehler commented 1 month ago

Closing as completed, please feel free to reopen!