EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.41k stars 308 forks source link

Unable to load quantized model: Insufficient memory while VRAM suffices #781

Closed rickbeeloo closed 1 month ago

rickbeeloo commented 1 month ago

Describe the bug

Not sure if this is a bug but perhaps it could then benefit from another example. According to the Rust example, we can load a quantized model like:

let loader = GGUFLoaderBuilder::new(
    None,
    Some("mistralai/Mistral-7B-Instruct-v0.1".to_string()),
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF".to_string(),
    vec!["mistral-7b-instruct-v0.1.Q4_K_M.gguf".to_string()],
    GGUFSpecificConfig {
        prompt_batchsize: None,
        topology: None,
    },
)

I have two NVIDIA RTX A6000 GPUs, so each having 48GB vRAM. Reading here, and just checking the size of Q4 LLama3.1 70b it should fit:

Q4_K_L  43.30GB false...

(Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf)

Doing this for the LLAMA3.1 model following the example:

let loader = GGUFLoaderBuilder::new(
    None,
    Some("meta-llama/Meta-Llama-3.1-70B".to_string()),
    "bartowski/Meta-Llama-3.1-70B-Instruct-GGUF".to_string(),
    vec!["Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf".to_string()],
    GGUFSpecificConfig {
        prompt_batchsize: None,
        topology: None,
    },
).build();

At around 60/80 repeated layers this fills my GPU RAM and crashes. I was able to load Meta-Llama-3.1-70B-Instruct-Q3_K_M.gguf (34.27GB),with takes up 46.9GB, but when prompting I again ran out of memory. It seems that I have to go to very low quantization with half of the RAM for mistral-rs to work while ollama could run them.

I would love to use mistral-rs instead, am I missing something?

Latest commit or version

2b67cc42e0ff82757b2434029d94c51826329e67

rickbeeloo commented 1 month ago

Should I use DeviceMapMetadata maybe? I see an example in the Python code but the Rust examples all have DeviceMapMetadata::dummy()

ShelbyJenkins commented 1 month ago

Ack, I didn't read your message close enough. I believe you're only using one card. The llm_client crate has an implementation you can copy to show how to add cuda devices to the devicemapper. https://github.com/ShelbyJenkins/llm_client/blob/master/src/llm_backends/mistral_rs/devices.rs

rickbeeloo commented 1 month ago

Hey @ShelbyJenkins, I was mainly curious whether mistral indeed requires an extra 10gb or so next to the tensors to get an idea of the ram usage.

But I actually have multiple GPUs so the device example comes in handy :)

rickbeeloo commented 1 month ago

Probably related: https://github.com/EricLBuehler/mistral.rs/issues/44

rickbeeloo commented 1 month ago

This can be closed, works with this https://github.com/EricLBuehler/mistral.rs/issues/44#issuecomment-2366931339