Memory leak when using local GGUF model with mistral.rs in Rust

Describe the bug

When using a locally loaded GGUF model with mistral.rs in Rust applications, the memory allocated for the model is not being released properly after the model object goes out of scope. This appears to be a memory leak that could lead to increased memory usage over time in long-running applications.

Latest commit or version

v0.2.5

@solaoi thanks for the issue! This is probably because we are launching a separate thread where the Engine is run.

I just merged #735, which exposes and uses a termination request to gracefully drop the Engine. Can you please confirm this works now on master?

I might be off the mark here (just chiming in as I've not attempted to reproduce). If it's a zombie process thing when mistral is run in a container as PID1, there's a crate called pid1 that you could probably use to avoid that.

Although you could also confirm with docker run --rm --init (or in compose.yaml via init: true). Likewise would then be resolvable without the init feature by having tini used as the entrypoint (if not pid1 crate).

@EricLBuehler Thank you for the update and for merging #735 with the Drop trait implementation. Unfortunately, I'm still encountering an error when trying to run the latest master. Here's the error message I'm getting:

error[E0004]: non-exhaustive patterns: `quantized::GgmlDType::BF16` not covered
   --> /Users/solaoi/.cargo/git/checkouts/candle-c6a149c3b35a488f/f706ef2/candle-core/src/quantized/metal.rs:49:15
    |
49  |         match self.dtype {
    |               ^^^^^^^^^^ pattern `quantized::GgmlDType::BF16` not covered
    |
note: `quantized::GgmlDType` defined here
   --> /Users/solaoi/.cargo/git/checkouts/candle-c6a149c3b35a488f/f706ef2/candle-core/src/quantized/mod.rs:145:10
    |
145 | pub enum GgmlDType {
    |          ^^^^^^^^^
...
148 |     BF16,
    |     ---- not covered
    = note: the matched value is of type `quantized::GgmlDType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
105 ~             },
106 +             quantized::GgmlDType::BF16 => todo!()
    |

I appreciate the implementation of the Drop trait. Since I'm using the Metal backend, I'm wondering if this error might be related to the recent PR for Metal support: https://github.com/EricLBuehler/mistral.rs/pull/719

Could this be the cause of the issue I'm experiencing?

@polarathene Thanks for your input. I'm actually running mistral.rs directly in my Rust application, not using Docker at the moment. However, I appreciate the tips about zombie processes and containerization - they'll be useful if I containerize in the future.

@solaoi thanks for letting me know, I just merged #719 which should fix this. Can you please try the build again?

@polarathene thanks for your thoughts! I think this was indeed a zombie process issue, we launch a seperate thread for the Engine and Pipeline, but when the MistralRs is dropped, the handle to that thread is, too, and all the associated memory leaks. That's my analysis, though (it may be wrong) and informed #735.

@EricLBuehler Thank you for merging #719. I've tried the new build with the Metal backend, but I'm still encountering an error. Here's the relevant part of the backtrace with RUST_BACKTRACE=full:

thread 'main' panicked at:
internal error: entered unreachable code
stack backtrace:
   0:        0x10602c980 - std::backtrace_rs::backtrace::libunwind::trace::ha6e1b57d52f71487
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5
   1:        0x10602c980 - std::backtrace_rs::backtrace::trace_unsynchronized::h9ddaec9267910606
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x10602c980 - std::sys_common::backtrace::_print_fmt::hd5e045ae4df2418e
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:68:5
   3:        0x10602c980 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h41035ce174e31160
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x10604e798 - core::fmt::rt::Argument::fmt::hd945519cb60b34bb
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/fmt/rt.rs:165:63
   5:        0x10604e798 - core::fmt::write::h7e946826fce7616b
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/fmt/mod.rs:1168:21
   6:        0x106029128 - std::io::Write::write_fmt::he3645adfefb23e4a
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/io/mod.rs:1835:15
   7:        0x10602c7d8 - std::sys_common::backtrace::_print::hb62ba094b434c569
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x10602c7d8 - std::sys_common::backtrace::print::h2efe9ae66fda73dc
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x10602d9b0 - std::panicking::default_hook::{{closure}}::hd27200b4fbd3bf40
  10:        0x10602d67c - std::panicking::default_hook::hb8656334461229c8
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:298:9
  11:        0x10602e25c - std::panicking::rust_panic_with_hook::h10171cf76e1aed15
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:795:13
  12:        0x10602dc6c - std::panicking::begin_panic_handler::{{closure}}::h9344de43a47cae21
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:656:13
  13:        0x10602ce04 - std::sys_common::backtrace::__rust_end_short_backtrace::h55013ada3ab9c4e8
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:171:18
  14:        0x10602da08 - rust_begin_unwind
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
  15:        0x10609cdb8 - core::panicking::panic_fmt::h0b16bb09366e1f01
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
  16:        0x10609ce38 - core::panicking::panic::h61ea408fdd25f03d
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:146:5
  17:        0x104832488 - test_mistral::main::hc4a2a0d542ec857f
                               at /Users/solaoi/Projects/solaoi/test_mistral/src/main.rs:80:14
  18:        0x104830a98 - core::ops::function::FnOnce::call_once::h55fdbabf3cb647d9
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  19:        0x10483005c - std::sys_common::backtrace::__rust_begin_short_backtrace::h921cd6afed4a8400
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:155:18
  20:        0x104835470 - std::rt::lang_start::{{closure}}::h6c398f7f270e2c70
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:159:18
  21:        0x106022930 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h081836dd0d716055
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:284:13
  22:        0x106022930 - std::panicking::try::do_call::h9d30dba5b3de8818
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:559:40
  23:        0x106022930 - std::panicking::try::h4de9fe721f1abb3d
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:523:19
  24:        0x106022930 - std::panic::catch_unwind::h597e8d4c5d489d43
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panic.rs:149:14
  25:        0x106022930 - std::rt::lang_start_internal::{{closure}}::h4cafcc96baeac274
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:141:48
  26:        0x106022930 - std::panicking::try::do_call::hd8cf3d55a1ab816e
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:559:40
  27:        0x106022930 - std::panicking::try::h1f2500bfe6fa656a
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:523:19
  28:        0x106022930 - std::panic::catch_unwind::h9cdfd4674b4e1cfa
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panic.rs:149:14
  29:        0x106022930 - std::rt::lang_start_internal::h27a134f18d582a1e
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:141:20
  30:        0x10483543c - std::rt::lang_start::h370176eb63af78c2
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:158:17
  31:        0x104832714 - _main

Importantly, this error only occurs when using the Metal backend. I've tested the same code with the CPU backend, and it works without any issues. This suggests that the problem might be specific to the Metal implementation in the library.

Could you please investigate this? Let me know if you need any additional information or if you'd like me to try anything specific to help diagnose the issue.

@solaoi thanks for letting me know. Can you please attach the full backtrace?

@EricLBuehler Of course. This is the full backtrace you requested:

RUST_BACKTRACE=full cargo run 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.61s
     Running `target/debug/hello_local_llm`
general.architecture: llama
general.file_type: 2
general.name: LLaMA v2
general.quantization_version: 2
llama.attention.head_count: 40
llama.attention.head_count_kv: 40
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 40
llama.context_length: 4096
llama.embedding_length: 5120
llama.feed_forward_length: 13824
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
llama.vocab_size: 43176
thread 'main' panicked at src/main.rs:80:14:
internal error: entered unreachable code
stack backtrace:
   0:        0x105b48980 - std::backtrace_rs::backtrace::libunwind::trace::ha6e1b57d52f71487
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5
   1:        0x105b48980 - std::backtrace_rs::backtrace::trace_unsynchronized::h9ddaec9267910606
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x105b48980 - std::sys_common::backtrace::_print_fmt::hd5e045ae4df2418e
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:68:5
   3:        0x105b48980 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h41035ce174e31160
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x105b6a798 - core::fmt::rt::Argument::fmt::hd945519cb60b34bb
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/fmt/rt.rs:165:63
   5:        0x105b6a798 - core::fmt::write::h7e946826fce7616b
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/fmt/mod.rs:1168:21
   6:        0x105b45128 - std::io::Write::write_fmt::he3645adfefb23e4a
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/io/mod.rs:1835:15
   7:        0x105b487d8 - std::sys_common::backtrace::_print::hb62ba094b434c569
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x105b487d8 - std::sys_common::backtrace::print::h2efe9ae66fda73dc
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x105b499b0 - std::panicking::default_hook::{{closure}}::hd27200b4fbd3bf40
  10:        0x105b4967c - std::panicking::default_hook::hb8656334461229c8
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:298:9
  11:        0x105b4a25c - std::panicking::rust_panic_with_hook::h10171cf76e1aed15
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:795:13
  12:        0x105b49c6c - std::panicking::begin_panic_handler::{{closure}}::h9344de43a47cae21
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:656:13
  13:        0x105b48e04 - std::sys_common::backtrace::__rust_end_short_backtrace::h55013ada3ab9c4e8
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:171:18
  14:        0x105b49a08 - rust_begin_unwind
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
  15:        0x105bb8db8 - core::panicking::panic_fmt::h0b16bb09366e1f01
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
  16:        0x105bb8e38 - core::panicking::panic::h61ea408fdd25f03d
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:146:5
  17:        0x10434e488 - test_mistral::main::hc4a2a0d542ec857f
                               at /Users/solaoi/Projects/solaoi/test_mistral/src/main.rs:80:14
  18:        0x10434ca98 - core::ops::function::FnOnce::call_once::h55fdbabf3cb647d9
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  19:        0x10434c05c - std::sys_common::backtrace::__rust_begin_short_backtrace::h921cd6afed4a8400
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:155:18
  20:        0x104351470 - std::rt::lang_start::{{closure}}::h6c398f7f270e2c70
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:159:18
  21:        0x105b3e930 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h081836dd0d716055
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:284:13
  22:        0x105b3e930 - std::panicking::try::do_call::h9d30dba5b3de8818
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:559:40
  23:        0x105b3e930 - std::panicking::try::h4de9fe721f1abb3d
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:523:19
  24:        0x105b3e930 - std::panic::catch_unwind::h597e8d4c5d489d43
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panic.rs:149:14
  25:        0x105b3e930 - std::rt::lang_start_internal::{{closure}}::h4cafcc96baeac274
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:141:48
  26:        0x105b3e930 - std::panicking::try::do_call::hd8cf3d55a1ab816e
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:559:40
  27:        0x105b3e930 - std::panicking::try::h1f2500bfe6fa656a
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:523:19
  28:        0x105b3e930 - std::panic::catch_unwind::h9cdfd4674b4e1cfa
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panic.rs:149:14
  29:        0x105b3e930 - std::rt::lang_start_internal::h27a134f18d582a1e
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:141:20
  30:        0x10435143c - std::rt::lang_start::h370176eb63af78c2
                               at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/rt.rs:158:17
  31:        0x10434e714 - _main

@solaoi I see. Can you please share the code you are running? It seems the panic occurs there, not in mistral.rs internal code.

@EricLBuehler Here's the code I'm running:

use mistralrs::{
    Constraint, DefaultSchedulerMethod, Device, DeviceMapMetadata, GGUFLoaderBuilder, GGUFSpecificConfig, MistralRs, MistralRsBuilder, ModelDType, NormalRequest, Request, RequestMessage, Response, SamplingParams, SchedulerConfig, TokenSource
};
use std::sync::Arc;
use tokio::sync::mpsc::channel;

fn setup() -> anyhow::Result<Arc<MistralRs>> {
    let loader = GGUFLoaderBuilder::new(
        Some("chat_templates_llama2.json".to_string()),
        None,
        ".".to_string(),
        vec!["aixsatoshi-Honyaku-13b-Q4_0.gguf".to_string()],
        GGUFSpecificConfig {
            prompt_batchsize: None,
            topology: None,
        },
    )
    .build();

    let pipeline = tokio::task::block_in_place(|| {
        loader.load_model_from_hf(
            None,
            TokenSource::None,
            &ModelDType::Auto,
            &Device::new_metal(0).unwrap(),
            false,
            DeviceMapMetadata::dummy(),
            None,
            None,
        )
    })?;

    Ok(MistralRsBuilder::new(
        pipeline,
        SchedulerConfig::DefaultScheduler {
            method: DefaultSchedulerMethod::Fixed(5.try_into().unwrap()),
        },
    )
    .build())
}

fn main() -> anyhow::Result<()> {
    let mistralrs = setup()?;
    let text = std::env::args()
        .nth(1)
        .unwrap_or_else(|| "Hello world!".to_string());
    let prompt = format!("<english>: {} <NL>\n\n<japanese>: ", text);

    let (tx, mut rx) = channel(10_000);
    let request = Request::Normal(NormalRequest {
        messages: RequestMessage::Completion {
            text: prompt,
            echo_prompt: false,
            best_of: 1,
        },
        sampling_params: SamplingParams::default(),
        response: tx,
        return_logprobs: false,
        is_streaming: false,
        id: 0,
        constraint: Constraint::None,
        suffix: None,
        adapters: None,
        tools: None,
        tool_choice: None,
        logits_processors: None,
    });
    mistralrs.get_sender()?.blocking_send(request)?;

    let response = rx.blocking_recv().unwrap();
    match response {
        Response::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
        _ => unreachable!(),
    }

    Ok(())
}

@solaoi I just merged #739 which makes handling errors a bit easier. Can you please modify your program to be:

use mistralrs::{
    Constraint, DefaultSchedulerMethod, Device, DeviceMapMetadata, GGUFLoaderBuilder, GGUFSpecificConfig, MistralRs, MistralRsBuilder, ModelDType, NormalRequest, Request, RequestMessage, ResponseOk, SamplingParams, SchedulerConfig, TokenSource
};
use std::sync::Arc;
use tokio::sync::mpsc::channel;

fn setup() -> anyhow::Result<Arc<MistralRs>> {
    let loader = GGUFLoaderBuilder::new(
        Some("chat_templates_llama2.json".to_string()),
        None,
        ".".to_string(),
        vec!["aixsatoshi-Honyaku-13b-Q4_0.gguf".to_string()],
        GGUFSpecificConfig {
            prompt_batchsize: None,
            topology: None,
        },
    )
    .build();

    let pipeline = tokio::task::block_in_place(|| {
        loader.load_model_from_hf(
            None,
            TokenSource::None,
            &ModelDType::Auto,
            &Device::new_metal(0).unwrap(),
            false,
            DeviceMapMetadata::dummy(),
            None,
            None,
        )
    })?;

    Ok(MistralRsBuilder::new(
        pipeline,
        SchedulerConfig::DefaultScheduler {
            method: DefaultSchedulerMethod::Fixed(5.try_into().unwrap()),
        },
    )
    .build())
}

fn main() -> anyhow::Result<()> {
    let mistralrs = setup()?;
    let text = std::env::args()
        .nth(1)
        .unwrap_or_else(|| "Hello world!".to_string());
    let prompt = format!("<english>: {} <NL>\n\n<japanese>: ", text);

    let (tx, mut rx) = channel(10_000);
    let request = Request::Normal(NormalRequest {
        messages: RequestMessage::Completion {
            text: prompt,
            echo_prompt: false,
            best_of: 1,
        },
        sampling_params: SamplingParams::default(),
        response: tx,
        return_logprobs: false,
        is_streaming: false,
        id: 0,
        constraint: Constraint::None,
        suffix: None,
        adapters: None,
        tools: None,
        tool_choice: None,
        logits_processors: None,
    });
    mistralrs.get_sender()?.blocking_send(request)?;

    let response = rx.blocking_recv().unwrap().as_result().unwrap();
    match response {
        ResponseOk::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
        _ => unreachable!(),
    }

    Ok(())
}

@EricLBuehler Thank you for the quick update and for merging #739. I appreciate your efforts to improve error handling.

I've modified my program as you suggested and tried running it. Unfortunately, I'm still encountering an error. Here's what I'm seeing now:

cargo run                       
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.47s
     Running `target/debug/test_mistral`
general.architecture: llama
general.file_type: 2
general.name: LLaMA v2
general.quantization_version: 2
llama.attention.head_count: 40
llama.attention.head_count_kv: 40
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 40
llama.context_length: 4096
llama.embedding_length: 5120
llama.feed_forward_length: 13824
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
llama.vocab_size: 43176
thread 'main' panicked at src/main.rs:70:60:
called `Result::unwrap()` on an `Err` value: CompletionModelError { msg: "Metal error Error while loading library: program_source:1530:18: error: unknown type name 'bfloat'; did you mean 'float'?\n    device const bfloat* x = (device const bfloat*) (src0 + offset0);\n                 ^~~~~~\n                 float\nprogram_source:1530:44: error: unknown type name 'bfloat'; did you mean 'float'?\n    device const bfloat* x = (device const bfloat*) (src0 + offset0);\n                                           ^~~~~~\n                                           float\nprogram_source:1543:22: error: unknown type name 'bfloat4'; did you mean 'float4'?\n        device const bfloat4* x4 = (device const bfloat4*) x;\n                     ^~~~~~~\n                     float4\n/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.723/include/metal/metal_extended_vector:134:55: note: 'float4' declared here\ntypedef __attribute__((__ext_vector_type__(4))) float float4;\n                                                      ^\nprogram_source:1543:50: error: unknown type name 'bfloat4'; did you mean 'float4'?\n        device const bfloat4* x4 = (device const bfloat4*) x;\n                                                 ^~~~~~~\n                                                 float4\n/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.723/include/metal/metal_extended_vector:134:55: note: 'float4' declared here\ntypedef __attribute__((__ext_vector_type__(4))) float float4;\n                                                      ^\nprogram_source:1616:18: error: unknown type name 'bfloat'; did you mean 'float'?\n    device const bfloat * x = (device const bfloat *) (src0 + offset0);\n                 ^~~~~~\n                 float\nprogram_source:1616:45: error: unknown type name 'bfloat'; did you mean 'float'?\n    device const bfloat * x = (device const bfloat *) (src0 + offset0);\n                                            ^~~~~~\n                                            float\nprogram_source:1638:22: error: unknown type name 'bfloat4'; did you mean 'float4'?\n        device const bfloat4 * x4 = (device const bfloat4 *)x;\n                     ^~~~~~~\n                     float4\n/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.723/include/metal/metal_extended_vector:134:55: note: 'float4' declared here\ntypedef __attribute__((__ext_vector_type__(4))) float float4;\n                                                      ^\nprogram_source:1638:51: error: unknown type name 'bfloat4'; did you mean 'float4'?\n        device const bfloat4 * x4 = (device const bfloat4 *)x;\n                                                  ^~~~~~~\n                                                  float4\n/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.723/include/metal/metal_extended_vector:134:55: note: 'float4' declared here\ntypedef __attribute__((__ext_vector_type__(4))) float float4;\n                                                      ^\nprogram_source:1721:18: error: unknown type name 'bfloat4'; did you mean 'float4'?\n    device const bfloat4 * x4 = (device const bfloat4 *) (src0 + offset0);\n                 ^~~~~~~\n                 float4\n/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.723/include/metal/metal_extended_vector:134:55: note: 'float4' declared here\ntypedef __attribute__((__ext_vector_type__(4))) float float4;\n                                                      ^\nprogram_source:1721:47: error: unknown type name 'bfloat4'; did you mean 'float4'?\n    device const bfloat4 * x4 = (device const bfloat4 *) (src0 + offset0);\n                                              ^~~~~~~\n                                              float4\n/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.723/include/metal/metal_extended_vector:134:55: note: 'float4' declared here\ntypedef __attribute__((__ext_vector_type__(4))) float float4;\n                                                      ^\n", incomplete_response: CompletionResponse { id: "0", choices: [CompletionChoice { finish_reason: "error", index: 0, text: "", logprobs: None }], created: 1725286799, model: ".", system_fingerprint: "local", object: "text_completion", usage: Usage { completion_tokens: 0, prompt_tokens: 24, total_tokens: 24, avg_tok_per_sec: 220.18349, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.109, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@solaoi thanks for the quick response! I just opened #740 (branch metal_build_quant_2), which tries to fix this. Can you please check if it works on that branch?

@EricLBuehler Thank you so much for your quick action! I really appreciate your dedication to resolving this issue.

I've just tested the metal_build_quant_2 branch, and I'm thrilled to report that it works perfectly! The program now runs without any errors on the Metal backend.

You're amazing! Thank you for your hard work and prompt responses throughout this process. It's great to see how quickly you've been able to address and resolve these issues.

@solaoi, glad to help and that it works! I just merged #740 into master.

Please feel free to open another issue if you have any questions or problems!

EricLBuehler / mistral.rs

Memory leak when using local GGUF model with mistral.rs in Rust #723

Describe the bug

Latest commit or version