Open ChristianWeyer opened 1 month ago
Hi @ChristianWeyer if you could please try to gather some T/s metrics, that'd be amazing for a quantitative comparison!
Sure!
Ollama has --verbose
:
❯ ollama run llama3.1:8b-instruct-fp16 --verbose
>>> tell me a joke
Here's one:
What do you call a fake noodle?
An impasta.
total duration: 1.29921225s
load duration: 34.187542ms
prompt eval count: 15 token(s)
prompt eval duration: 483.086ms
prompt eval rate: 31.05 tokens/s
eval count: 18 token(s)
eval duration: 781.205ms
eval rate: 23.04 tokens/s
Is there anything similar for mistral.ai @EricLBuehler ?
Yes, mistral.rs has --throughput before the model selector (plain). It can be used with the server.
Is there a trick to see the throughput values in interactive mode? Or does it not work with -i?
❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
Finished `release` profile [optimized] target(s) in 0.43s
Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-07-25T18:40:01.534562Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-07-25T18:40:01.534616Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-07-25T18:40:01.534652Z INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-07-25T18:40:01.535339Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:01.535540Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.049554Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-07-25T18:40:02.205875Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.783612Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.786192Z INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:02.786199Z INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968875].
2024-07-25T18:40:02.786317Z INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }) }
100%|█████████████████████████████████████████████████████████████| 82/82 [00:07<00:00, 29.71it/s]
100%|██████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 128.57it/s]
100%|███████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 80.46it/s]
100%|█████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 1787.63it/s]
2024-07-25T18:40:12.803523Z INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:13.201142Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-07-25T18:40:13.214374Z INFO mistralrs_server: Model loaded.
2024-07-25T18:40:13.214438Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
> tell me a joke
Here's one:
What do you call a fake noodle?
An impasta.
>
@ChristianWeyer not at the moment, it is only for the server. I will add that tomorrow, but in the meantime, if you start up an OAI server (perhaps in both) we can isolate whether the issue is in model performance or streaming implementation.
Have you been able to update the code for the interactive mode @EricLBuehler ?
@ChristianWeyer yes in #655.
Sorry for the late reply @EricLBuehler.
I just tried to run the latest commit (8cab33bb91096ab38e19565ceee75cdd52f0b02a) with
cargo run --release --features metal -- -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
and got an error:
error[E0004]: non-exhaustive patterns: `DType::I32` not covered
--> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:145:23
|
145 | match storage.dtype() {
| ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
|
note: `DType` defined here
--> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
|
8 | pub enum DType {
| ^^^^^
...
14 | I32,
| --- not covered
= note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
|
152 ~ DType::I64 => "asort_asc_i64",
153 ~ DType::I32 => todo!(),
|
error[E0004]: non-exhaustive patterns: `DType::I32` not covered
--> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:155:23
|
155 | match storage.dtype() {
| ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
|
note: `DType` defined here
--> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
|
8 | pub enum DType {
| ^^^^^
...
14 | I32,
| --- not covered
= note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
|
162 ~ DType::I64 => "asort_desc_i64",
163 ~ DType::I32 => todo!(),
|
Compiling pyo3-macros v0.22.2
Compiling rust-embed-impl v8.5.0
Compiling derive_builder v0.20.0
Compiling esaxx-rs v0.1.10
Compiling darling v0.11.0
Compiling utoipa-gen v4.3.0
For more information about this error, try `rustc --explain E0004`.
error: could not compile `candle-core` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
@ChristianWeyer sorry for the trouble, I think this should be fixed in #681.
Sure, no problem @EricLBuehler. Now it compiles.
But at runtime is crashes:
Running `target/release/mistralrs-server -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-14T11:51:59.682284Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-14T11:51:59.682438Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-14T11:51:59.682537Z INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-14T11:51:59.684890Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:51:59.685635Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:03.626990Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-14T11:52:04.101942Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.642770Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.645004Z INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968915].
Error: Metal error Error while loading function: "Function bgemm was not found in the library"
I have a similar issue (Error: Metal error Error while loading function: "Function bgemm was not found in the library"), but I'm using google/gemma-2-9b-it
on a M1 Mac Studio.
Exact same issue using microsoft/Phi-3-vision-128k-instruct
on a M1 Max.
@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:
git fetch origin tag v0.2.4
git checkout v0.2.4
And then rebuild to see if it works?
@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:
git fetch origin tag v0.2.4 git checkout v0.2.4
And then rebuild to see if it works?
I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.
I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.
Yeah a bisect would be very helpful!
% cargo run --release --features metal -- -i vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
Compiling mistralrs-core v0.2.4 (/Users/jl/Code/mistral.rs/mistralrs-core)
error[E0308]: arguments to this method are incorrect
--> mistralrs-core/src/pipeline/isq.rs:128:30
|
128 | ... .apply_isq(dtype, &n_quantized, device)
| ^^^^^^^^^ ------------ ------ expected `&AtomicUsize`, found `candle_core::Device`
| |
| expected `candle_core::Device`, found `&AtomicUsize`
|
note: method defined here
--> /Users/jl/Code/mistral.rs/mistralrs-quant/src/lib.rs:126:8
|
126 | fn apply_isq(
| ^^^^^^^^^
help: swap these arguments
|
128 | .apply_isq(dtype, device, &n_quantized)
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For more information about this error, try `rustc --explain E0308`.
error: could not compile `mistralrs-core` (lib) due to 1 previous error
@EricLBuehler #683 has broken compilation on master as an FYI.
Yeah a bisect would be very helpful!
Correcting for the red herring commits (because of the wrong candle commit), it's caused by the rewrite of the automatic dtype inference. Specifically, this change has led to the newer version of try_into_dtype
calling determine_auto_dtype_all
, which is missing a case (candle_core::Error::Metal(_)
) - thrown due to the lack of BF16 support. Forcing f16 works fine.
I've opened #685 with the missing error case added, confirmed working without -d f16
.
@xfer @ChristianWeyer I just merged @ac3xx's PR #685 which should fix this issue. I also merged #685 which should fix the compilation issue. So, I think master
should be working now, but confirmation from someone with a Metal machine would be great.
@EricLBuehler for gemma-2-2b-it
and gemma-2-2b
it is working fine!
Also sorry for not testing the bisect 😞
OK, so then here - finally - the stats you requested @EricLBuehler:
Ollama: total duration: 2.240256875s load duration: 32.448458ms prompt eval count: 15 token(s) prompt eval duration: 560.735ms prompt eval rate: 26.75 tokens/s eval count: 37 token(s) eval duration: 1.646012s eval rate: 22.48 tokens/s
mistral.rs 2024-08-15T12:54:06.636383Z INFO mistralrs_server::interactive_mode: Average T/s: 10.96718959597559
@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!
Great, glad to hear @xfer! No worries about the bisect.
OK, so then here - finally - the stats you requested @EricLBuehler:
@ChristianWeyer thanks for letting me know. I'll see what optimizations we can make.
Do you need more help to identify potential performance issues @EricLBuehler?
@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!
The latest commit (575286b5d48a569813ce7c35f953f44b1800146d) gives me this error @EricLBuehler :
cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llam
error[E0425]: cannot find value `rhs` in this scope
--> mistralrs-quant/src/utils/ops.rs:306:31
|
306 | let original_device = rhs.device();
| ^^^ not found in this scope
error[E0061]: this method takes 2 arguments but 1 argument was supplied
--> mistralrs-quant/src/utils/ops.rs:308:14
|
308 | .apply_op2_no_bwd(&Leftshift(n))?
| ^^^^^^^^^^^^^^^^ ------------- an argument of type `&candle_core::Tensor` is missing
|
note: method defined here
--> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/2386e4e/candle-core/src/custom_op.rs:162:12
|
162 | pub fn apply_op2_no_bwd<C: CustomOp2>(&self, rhs: &Self, c: &C) -> Result<Self> {
| ^^^^^^^^^^^^^^^^
help: provide the argument
|
308 | .apply_op2_no_bwd(/* &candle_core::Tensor */, &Leftshift(n))?
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compiling mistralrs-vision v0.2.5 (/Users/christianweyer/Sources/mistral.rs/mistralrs-vision)
Some errors have detailed explanations: E0061, E0425.
For more information about an error, try `rustc --explain E0061`.
error: could not compile `mistralrs-quant` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
@ChristianWeyer thanks for letting me know, 70c647c should fix this now.
@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!
Voila:
❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
Finished `release` profile [optimized] target(s) in 0.60s
Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-19T14:10:52.254964Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-19T14:10:52.255064Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-19T14:10:52.255104Z INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-19T14:10:52.255541Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:10:52.255857Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:22.505704Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-19T14:11:22.843371Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.354823Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.357734Z INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968463].
2024-08-19T14:11:23.366631Z INFO mistralrs_core::utils::normal: DType selected is F16.
2024-08-19T14:11:23.366839Z INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None }
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:06<00:00, 20.91it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 135.13it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 107.32it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 52.92it/s]
2024-08-19T14:11:31.688832Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-08-19T14:11:31.698495Z INFO mistralrs_server: Model loaded.
2024-08-19T14:11:31.698591Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
>
Did that help @EricLBuehler?
@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463]
, so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?
Sorry for the late reply.
(On holidays… back at the weekend 🌴)
@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow:
metal[4294968463]
, so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?Sorry for the late reply.
Tried with commit cccdd27f549f4a6f12daf4ed4764861551449fa0 - and ran into this error:
Compiling mistralrs-server v0.3.0 (/Users/christianweyer/Sources/mistral.rs/mistralrs-server)
error[E0658]: use of unstable library feature 'absolute_path'
--> mistralrs-server/src/util.rs:11:34
|
11 | url::Url::from_file_path(std::path::absolute(url_unparsed)?)
| ^^^^^^^^^^^^^^^^^^^
|
= note: see issue #92750 <https://github.com/rust-lang/rust/issues/92750> for more information
For more information about this error, try `rustc --explain E0658`.
error: could not compile `mistralrs-server` (bin "mistralrs-server") due to 1 previous error
@ChristianWeyer as of v0.3.0, or MSRV is now 1.79. This error indicates that you have less than that version installed, can you please run rustup update
?
Describe the bug
Have a look :-)
https://github.com/user-attachments/assets/321dbb21-2403-4330-9ce1-091902298888
Latest commit or version
0.22 MBP M3 Max