Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1

ChristianWeyer commented 1 month ago

Describe the bug

Have a look :-)

https://github.com/user-attachments/assets/321dbb21-2403-4330-9ce1-091902298888

Latest commit or version

0.22 MBP M3 Max

EricLBuehler commented 1 month ago

Hi @ChristianWeyer if you could please try to gather some T/s metrics, that'd be amazing for a quantitative comparison!

ChristianWeyer commented 1 month ago

Sure!

Ollama has --verbose:

❯ ollama run llama3.1:8b-instruct-fp16 --verbose
>>> tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.

total duration:       1.29921225s
load duration:        34.187542ms
prompt eval count:    15 token(s)
prompt eval duration: 483.086ms
prompt eval rate:     31.05 tokens/s
eval count:           18 token(s)
eval duration:        781.205ms
eval rate:            23.04 tokens/s

Is there anything similar for mistral.ai @EricLBuehler ?

EricLBuehler commented 1 month ago

Yes, mistral.rs has --throughput before the model selector (plain). It can be used with the server.

ChristianWeyer commented 1 month ago

Is there a trick to see the throughput values in interactive mode? Or does it not work with -i?

❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
    Finished `release` profile [optimized] target(s) in 0.43s
     Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-07-25T18:40:01.534562Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-07-25T18:40:01.534616Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-07-25T18:40:01.534652Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-07-25T18:40:01.535339Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:01.535540Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.049554Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-07-25T18:40:02.205875Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.783612Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.786192Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:02.786199Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968875].
2024-07-25T18:40:02.786317Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }) }
100%|█████████████████████████████████████████████████████████████| 82/82 [00:07<00:00, 29.71it/s]
100%|██████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 128.57it/s]
100%|███████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 80.46it/s]
100%|█████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 1787.63it/s]
2024-07-25T18:40:12.803523Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:13.201142Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-07-25T18:40:13.214374Z  INFO mistralrs_server: Model loaded.
2024-07-25T18:40:13.214438Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
> tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.
>

EricLBuehler commented 1 month ago

@ChristianWeyer not at the moment, it is only for the server. I will add that tomorrow, but in the meantime, if you start up an OAI server (perhaps in both) we can isolate whether the issue is in model performance or streaming implementation.

ChristianWeyer commented 1 month ago

Have you been able to update the code for the interactive mode @EricLBuehler ?

EricLBuehler commented 1 month ago

@ChristianWeyer yes in #655.

ChristianWeyer commented 3 weeks ago

Sorry for the late reply @EricLBuehler.

I just tried to run the latest commit (8cab33bb91096ab38e19565ceee75cdd52f0b02a) with cargo run --release --features metal -- -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

and got an error:

error[E0004]: non-exhaustive patterns: `DType::I32` not covered
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:145:23
    |
145 |                 match storage.dtype() {
    |                       ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
    |
note: `DType` defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
    |
8   | pub enum DType {
    |          ^^^^^
...
14  |     I32,
    |     --- not covered
    = note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
152 ~                     DType::I64 => "asort_asc_i64",
153 ~                     DType::I32 => todo!(),
    |

error[E0004]: non-exhaustive patterns: `DType::I32` not covered
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:155:23
    |
155 |                 match storage.dtype() {
    |                       ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
    |
note: `DType` defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
    |
8   | pub enum DType {
    |          ^^^^^
...
14  |     I32,
    |     --- not covered
    = note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
162 ~                     DType::I64 => "asort_desc_i64",
163 ~                     DType::I32 => todo!(),
    |

   Compiling pyo3-macros v0.22.2
   Compiling rust-embed-impl v8.5.0
   Compiling derive_builder v0.20.0
   Compiling esaxx-rs v0.1.10
   Compiling darling v0.11.0
   Compiling utoipa-gen v4.3.0
For more information about this error, try `rustc --explain E0004`.
error: could not compile `candle-core` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

EricLBuehler commented 3 weeks ago

@ChristianWeyer sorry for the trouble, I think this should be fixed in #681.

ChristianWeyer commented 3 weeks ago

Sure, no problem @EricLBuehler. Now it compiles.

But at runtime is crashes:

     Running `target/release/mistralrs-server -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-14T11:51:59.682284Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-14T11:51:59.682438Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-14T11:51:59.682537Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-14T11:51:59.684890Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:51:59.685635Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:03.626990Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-14T11:52:04.101942Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.642770Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.645004Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968915].
Error: Metal error Error while loading function: "Function bgemm was not found in the library"

xfer commented 3 weeks ago

I have a similar issue (Error: Metal error Error while loading function: "Function bgemm was not found in the library"), but I'm using google/gemma-2-9b-it on a M1 Mac Studio.

ac3xx commented 3 weeks ago

Exact same issue using microsoft/Phi-3-vision-128k-instruct on a M1 Max.

EricLBuehler commented 3 weeks ago

@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:

git fetch origin tag v0.2.4
git checkout v0.2.4

And then rebuild to see if it works?

ac3xx commented 3 weeks ago

@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:
git fetch origin tag v0.2.4

git checkout v0.2.4
And then rebuild to see if it works?

I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.

EricLBuehler commented 3 weeks ago

I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.

Yeah a bisect would be very helpful!

ac3xx commented 3 weeks ago

% cargo run --release --features metal -- -i vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
   Compiling mistralrs-core v0.2.4 (/Users/jl/Code/mistral.rs/mistralrs-core)
error[E0308]: arguments to this method are incorrect
   --> mistralrs-core/src/pipeline/isq.rs:128:30
    |
128 | ...                   .apply_isq(dtype, &n_quantized, device)
    |                        ^^^^^^^^^        ------------  ------ expected `&AtomicUsize`, found `candle_core::Device`
    |                                         |
    |                                         expected `candle_core::Device`, found `&AtomicUsize`
    |
note: method defined here
   --> /Users/jl/Code/mistral.rs/mistralrs-quant/src/lib.rs:126:8
    |
126 |     fn apply_isq(
    |        ^^^^^^^^^
help: swap these arguments
    |
128 |                             .apply_isq(dtype, device, &n_quantized)
    |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For more information about this error, try `rustc --explain E0308`.
error: could not compile `mistralrs-core` (lib) due to 1 previous error

@EricLBuehler #683 has broken compilation on master as an FYI.

Yeah a bisect would be very helpful!

Correcting for the red herring commits (because of the wrong candle commit), it's caused by the rewrite of the automatic dtype inference. Specifically, this change has led to the newer version of try_into_dtype calling determine_auto_dtype_all, which is missing a case (candle_core::Error::Metal(_)) - thrown due to the lack of BF16 support. Forcing f16 works fine.

I've opened #685 with the missing error case added, confirmed working without -d f16.

EricLBuehler commented 3 weeks ago

@xfer @ChristianWeyer I just merged @ac3xx's PR #685 which should fix this issue. I also merged #685 which should fix the compilation issue. So, I think master should be working now, but confirmation from someone with a Metal machine would be great.

xfer commented 3 weeks ago

@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!

Also sorry for not testing the bisect 😞

ChristianWeyer commented 3 weeks ago

OK, so then here - finally - the stats you requested @EricLBuehler:

Ollama: total duration: 2.240256875s load duration: 32.448458ms prompt eval count: 15 token(s) prompt eval duration: 560.735ms prompt eval rate: 26.75 tokens/s eval count: 37 token(s) eval duration: 1.646012s eval rate: 22.48 tokens/s

mistral.rs 2024-08-15T12:54:06.636383Z INFO mistralrs_server::interactive_mode: Average T/s: 10.96718959597559

EricLBuehler commented 3 weeks ago

@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!

Great, glad to hear @xfer! No worries about the bisect.

OK, so then here - finally - the stats you requested @EricLBuehler:

@ChristianWeyer thanks for letting me know. I'll see what optimizations we can make.

ChristianWeyer commented 3 weeks ago

Do you need more help to identify potential performance issues @EricLBuehler?

EricLBuehler commented 3 weeks ago

@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!

ChristianWeyer commented 3 weeks ago

The latest commit (575286b5d48a569813ce7c35f953f44b1800146d) gives me this error @EricLBuehler :

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llam

error[E0425]: cannot find value `rhs` in this scope
   --> mistralrs-quant/src/utils/ops.rs:306:31
    |
306 |         let original_device = rhs.device();
    |                               ^^^ not found in this scope

error[E0061]: this method takes 2 arguments but 1 argument was supplied
   --> mistralrs-quant/src/utils/ops.rs:308:14
    |
308 |             .apply_op2_no_bwd(&Leftshift(n))?
    |              ^^^^^^^^^^^^^^^^ ------------- an argument of type `&candle_core::Tensor` is missing
    |
note: method defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/2386e4e/candle-core/src/custom_op.rs:162:12
    |
162 |     pub fn apply_op2_no_bwd<C: CustomOp2>(&self, rhs: &Self, c: &C) -> Result<Self> {
    |            ^^^^^^^^^^^^^^^^
help: provide the argument
    |
308 |             .apply_op2_no_bwd(/* &candle_core::Tensor */, &Leftshift(n))?
    |                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Compiling mistralrs-vision v0.2.5 (/Users/christianweyer/Sources/mistral.rs/mistralrs-vision)
Some errors have detailed explanations: E0061, E0425.
For more information about an error, try `rustc --explain E0061`.
error: could not compile `mistralrs-quant` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

EricLBuehler commented 3 weeks ago

@ChristianWeyer thanks for letting me know, 70c647c should fix this now.

ChristianWeyer commented 2 weeks ago

@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!

Voila:

❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
    Finished `release` profile [optimized] target(s) in 0.60s
     Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-19T14:10:52.254964Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-19T14:10:52.255064Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-19T14:10:52.255104Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-19T14:10:52.255541Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:10:52.255857Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:22.505704Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-19T14:11:22.843371Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.354823Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.357734Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968463].
2024-08-19T14:11:23.366631Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-08-19T14:11:23.366839Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None }
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:06<00:00, 20.91it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 135.13it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 107.32it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 52.92it/s]
2024-08-19T14:11:31.688832Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-08-19T14:11:31.698495Z  INFO mistralrs_server: Model loaded.
2024-08-19T14:11:31.698591Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
>

ChristianWeyer commented 2 weeks ago

Did that help @EricLBuehler?

EricLBuehler commented 1 week ago

@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463], so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?

Sorry for the late reply.

ChristianWeyer commented 5 days ago

(On holidays… back at the weekend 🌴)

ChristianWeyer commented 1 day ago

@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463], so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?

Sorry for the late reply.

Tried with commit cccdd27f549f4a6f12daf4ed4764861551449fa0 - and ran into this error:

   Compiling mistralrs-server v0.3.0 (/Users/christianweyer/Sources/mistral.rs/mistralrs-server)
error[E0658]: use of unstable library feature 'absolute_path'
  --> mistralrs-server/src/util.rs:11:34
   |
11 |         url::Url::from_file_path(std::path::absolute(url_unparsed)?)
   |                                  ^^^^^^^^^^^^^^^^^^^
   |
   = note: see issue #92750 <https://github.com/rust-lang/rust/issues/92750> for more information

For more information about this error, try `rustc --explain E0658`.
error: could not compile `mistralrs-server` (bin "mistralrs-server") due to 1 previous error

EricLBuehler commented 24 minutes ago

@ChristianWeyer as of v0.3.0, or MSRV is now 1.79. This error indicates that you have less than that version installed, can you please run rustup update?

EricLBuehler / mistral.rs

Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

Describe the bug

Latest commit or version