Closed rickbeeloo closed 1 month ago
Should I use DeviceMapMetadata
maybe? I see an example in the Python code but the Rust examples all have DeviceMapMetadata::dummy()
Ack, I didn't read your message close enough. I believe you're only using one card. The llm_client crate has an implementation you can copy to show how to add cuda devices to the devicemapper. https://github.com/ShelbyJenkins/llm_client/blob/master/src/llm_backends/mistral_rs/devices.rs
Hey @ShelbyJenkins, I was mainly curious whether mistral indeed requires an extra 10gb or so next to the tensors to get an idea of the ram usage.
But I actually have multiple GPUs so the device example comes in handy :)
Probably related: https://github.com/EricLBuehler/mistral.rs/issues/44
This can be closed, works with this https://github.com/EricLBuehler/mistral.rs/issues/44#issuecomment-2366931339
Describe the bug
Not sure if this is a bug but perhaps it could then benefit from another example. According to the Rust example, we can load a quantized model like:
I have two NVIDIA RTX A6000 GPUs, so each having 48GB vRAM. Reading here, and just checking the size of Q4 LLama3.1 70b it should fit:
(Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf)
Doing this for the LLAMA3.1 model following the example:
At around 60/80 repeated layers this fills my GPU RAM and crashes. I was able to load
Meta-Llama-3.1-70B-Instruct-Q3_K_M.gguf
(34.27GB),with takes up 46.9GB, but when prompting I again ran out of memory. It seems that I have to go to very low quantization with half of the RAM for mistral-rs to work while ollama could run them.I would love to use mistral-rs instead, am I missing something?
Latest commit or version
2b67cc42e0ff82757b2434029d94c51826329e67