Closed nomopo45 closed 1 month ago
Hi @nomopo45! Happy to answer any questions.
from my understanding the -t refers to tokenizer_config.json ? the -m would be the path of the model ? and -f the actual model gguf file name ?
Yes.
So i tried a curl but it's super long to get a reply like ~10sec:
The first request will always be a bit slower as the tensors are loaded into memory. Some other systems opt to do this during startup, but we did not do that to ensure fast startup times. If this problem continues past the first request, ensure that you are building with the correct hardware accelerator, if you have one. Note that the performance will decrease as the prompt length goes on, but that is a known side effect of chat models, and we are actually working on a feature to optimize this (#350, #366).
now i tried to have an interactive but all the command i used failed could someone guide me on how to get an interactive chat ?
Absolutely! I'm not sure what your hardware is, but here are a few examples (I merged the run with the build command using cargo run
). To use interactive mode, you replace --port xxxx
with -i
.
On CUDA:
cargo run --release --features cuda -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
On CPU:
cargo run --release -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
On Metal:
cargo run --release --features metal -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Could someone explain to me what is LoRa and X-Lora what it means what they do, why should we use them ?
LoRA is a popular technique for fine-tuning models, which involves efficiently training adapters. X-LoRA is a mixture of LoRA experts, and uses dense gating to choose experts. See the X-LoRA paper. X-LoRA and LoRA are distinct methods, but they both improve performance of the model (and at zero temporal cost for LoRA).
Hello thanks a lot for your answers !
So I have a macbook m1 pro 16gb.
I give a try yo interactive chat using your command, after seeing your command and some i feel it's not possible to have an interactive chat with gguf file am i right ?
Then i donwload this repo : https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/tree/main
and started the following command:
cargo run --release --features metal -- -i plain -m /Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5 -a llama
Finished `release` profile [optimized] target(s) in 0.91s
Running `target/release/mistralrs-server -i plain -m /Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5 -a llama`
2024-06-17T02:03:04.951259Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-06-17T02:03:04.951311Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-06-17T02:03:04.951359Z INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-06-17T02:03:04.951403Z INFO hf_hub: Token file not found "/Users/xxx/.cache/huggingface/token"
2024-06-17T02:03:04.951441Z INFO mistralrs_core::utils::tokens: Could not load token at "/Users/xxx/.cache/huggingface/token", using no HF token.
2024-06-17T02:03:04.951618Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5`
2024-06-17T02:03:05.333603Z INFO mistralrs_core::pipeline::normal: Loading `"tokenizer.json"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/tokenizer.json"`
2024-06-17T02:03:05.333636Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5`
2024-06-17T02:03:05.618104Z INFO mistralrs_core::pipeline::normal: Loading `"config.json"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/config.json"`
2024-06-17T02:03:06.208592Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00005-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00005-of-00007.safetensors"`
2024-06-17T02:03:06.488109Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00001-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00001-of-00007.safetensors"`
2024-06-17T02:03:06.754389Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00004-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00004-of-00007.safetensors"`
2024-06-17T02:03:07.037357Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00007-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00007-of-00007.safetensors"`
2024-06-17T02:03:07.329491Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00002-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00002-of-00007.safetensors"`
2024-06-17T02:03:07.602342Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00006-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00006-of-00007.safetensors"`
2024-06-17T02:03:07.884465Z INFO mistralrs_core::pipeline::paths: Loading `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00003-of-00007.safetensors"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/model-00003-of-00007.safetensors"`
2024-06-17T02:03:08.742408Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5`
2024-06-17T02:03:09.039340Z INFO mistralrs_core::pipeline::normal: Loading `"tokenizer_config.json"` locally at `"/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5/tokenizer_config.json"`
2024-06-17T02:03:09.040618Z INFO mistralrs_core::utils::normal: DType selected is F32.
2024-06-17T02:03:09.040724Z INFO mistralrs_core::pipeline::normal: Loading model `/Users/xxx/Documents/mistral.rs/models/MiniCPM-Llama3-V-2_5` on Metal(MetalDevice(DeviceId(1)))...
2024-06-17T02:03:09.041088Z INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 8192 }
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:04<00:00, 4.86it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:05<00:00, 6.17it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:09<00:00, 7.90it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456/456 [00:12<00:00, 143.62it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:12<00:00, 5.93it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:10<00:00, 4.46it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:09<00:00, 12.17it/s]
Error: cannot find tensor model.embed_tokens.weight
So as you see it ends with an error and i'm not what is this model.embed_tokens.weight ?
Now about the curl here is the command i used to the build :
cargo build --release --features metal
then here is the command i use to start the server :
./mistralrs_server --port 1234 gguf -t /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/ -m /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF -f Meta-Llama-3-8B-Instruct-Q8_0.gguf
2024-06-17T02:12:33.837370Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-06-17T02:12:33.837438Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-06-17T02:12:33.837475Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-06-17T02:12:33.837534Z INFO hf_hub: Token file not found "/Users/xxx/.cache/huggingface/token"
2024-06-17T02:12:33.837588Z INFO mistralrs_core::utils::tokens: Could not load token at "/Users/xxx/.cache/huggingface/token", using no HF token.
2024-06-17T02:12:33.837786Z INFO mistralrs_core::pipeline::gguf: Loading `tokenizer_config.json` at `/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/` because no chat template file was specified.
2024-06-17T02:12:34.203812Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer_config.json"` locally at `"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tokenizer_config.json"`
2024-06-17T02:12:34.204364Z INFO hf_hub: Token file not found "/Users/xxx/.cache/huggingface/token"
2024-06-17T02:12:34.204423Z INFO mistralrs_core::utils::tokens: Could not load token at "/Users/xxx/.cache/huggingface/token", using no HF token.
2024-06-17T02:12:34.499030Z INFO mistralrs_core::pipeline::paths: Loading `"Meta-Llama-3-8B-Instruct-Q8_0.gguf"` locally at `"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf"`
2024-06-17T02:12:35.451508Z INFO mistralrs_core::pipeline::gguf: Loading `tokenizer.json` at `/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/`
2024-06-17T02:12:36.719412Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tokenizer.json"`
2024-06-17T02:12:36.719565Z INFO mistralrs_core::pipeline::gguf: Loading model `/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/` on Metal(MetalDevice(DeviceId(1)))...
2024-06-17T02:12:37.249332Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 7
general.name: Meta-Llama-3-8B-Instruct-imatrix
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 8192
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-06-17T02:12:43.418126Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|end_of_text|>", "<|eot_id|>", unk_tok = `None`
2024-06-17T02:12:43.428386Z INFO mistralrs_server: Model loaded.
2024-06-17T02:12:43.429149Z INFO mistralrs_server: Serving on http://0.0.0.0:1234.
and then i do my curl command, i did the same command 3 times to show you the time it take:
{"id":"0","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello there! It's lovely to meet you! I'm Mistral.rs, a language model AI assistant, here to help and assist you to the best of my abilities. What's on your mind? Do you have any questions, topics you'd like to discuss, or perhaps some tasks you'd like me to help you with? I'm all ears (or rather, all text)!","role":"assistant"},"logprobs":null}],"created":1718336660,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":79,"prompt_tokens":29,"total_tokens":108,"avg_tok_per_sec":4.654972,"avg_prompt_tok_per_sec":2.3949127,"avg_compl_tok_per_sec":7.12225,"total_time_sec":23.201,"total_prompt_time_sec":12.109,"total_completion_time_sec":11.092}}
{"id":"1","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello there! It's great to meet you! I'm Mistral.rs, your friendly AI assistant. How can I help you today? Do you have a question, or perhaps a topic you'd like to discuss?","role":"assistant"},"logprobs":null}],"created":1718336692,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":45,"prompt_tokens":29,"total_tokens":74,"avg_tok_per_sec":6.2153535,"avg_prompt_tok_per_sec":10.079945,"avg_compl_tok_per_sec":4.9839406,"total_time_sec":11.906,"total_prompt_time_sec":2.877,"total_completion_time_sec":9.029}}
{"id":"2","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hello! Nice to meet you! Is there something I can help you with today? I'm Mistral.rs, your friendly AI assistant.","role":"assistant"},"logprobs":null}],"created":1718336732,"model":"/Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/","system_fingerprint":"local","object":"chat.completion","usage":{"completion_tokens":29,"prompt_tokens":29,"total_tokens":58,"avg_tok_per_sec":3.0154934,"avg_prompt_tok_per_sec":4.3071437,"avg_compl_tok_per_sec":2.3198144,"total_time_sec":19.234,"total_prompt_time_sec":6.733,"total_completion_time_sec":12.501}}
Completion seems quite slow but maybe it's normal for a computer like mine just would like to have your opinion on this.
Thanks a lot!
Hi @nomopo45!
I give a try yo interactive chat using your command, after seeing your command and some i feel it's not possible to have an interactive chat with gguf file am i right ?
No, you can:
./mistralrs_server -i gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Then i donwload this repo : https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/tree/main
The CPM model architecture (non-GGUF) is not compatible with the "plain" llama architecture, it would be necessary to add a mini-cpm
architecture to the vision-plain
category.
Completion seems quite slow but maybe it's normal for a computer like mine just would like to have your opinion on this.
I have seen similar results on similar hardware. This performance is not great though, could you please try to build & rerun on the CPU?
Closing as completed, please feel free to reopen!
Hello,
Thanks for the project it look really nice ! i'm new in this world and i'm struggling to do what i want.
I have a macbook m1 pro 16gb. I managed to install and make it run but some things i would like to do don't work.
First i use this command :
from my understanding the -t refers to tokenizer_config.json ? the -m would be the path of the model ? and -f the actual model gguf file name ?
Once the command runs it says :
So i tried a curl but it's super long to get a reply like ~10sec:
Is it normal for it to be so slow ?
now i tried to have an interactive but all the command i used failed could someone guide me on how to get an interactive chat ?
Could someone explain to me what is LoRa and X-Lora what it means what they do, why should we use them ?
I'm also using LMstudio, but i wanted to give a try to this project since it seems the token/s were a strong point of this porject.
Anyway thanks in advance for any replies and help!
PS: here is what's inside the /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/