Open VicKov1975 opened 8 months ago
Hi, we have recently integrated our models into llama-cpp-python directly. Here's how you can use it. Can you try it and see if it works now?
I tested it on my end using the following code and the model loads using 4.835GB GPU VRAM.
llm = Llama.from_pretrained(
repo_id="meetkai/functionary-7b-v2-GGUF",
filename="functionary-7b-v2.q4_0.gguf",
chat_format="functionary-v2",
tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-7b-v2-GGUF"),
n_ctx=4096,
n_gpu_layers=-1,
)
Hi, we have recently integrated our models into llama-cpp-python directly. Here's how you can use it. Can you try it and see if it works now?
I tested it on my end using the following code and the model loads using 4.835GB GPU VRAM.
llm = Llama.from_pretrained( repo_id="meetkai/functionary-7b-v2-GGUF", filename="functionary-7b-v2.q4_0.gguf", chat_format="functionary-v2", tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-7b-v2-GGUF"), n_ctx=4096, n_gpu_layers=-1, )
Yes it works. Quick question: is there a way to load a local GGUF file instead of downloading it from the hub?
Sorry for being so late but yes, you can load a local GGUF file by just initializing the Llama class directly. Here's a guide showing how.
This is how I am loading the model using Python, but it uses only the CPU:
Llama(model_path="./functionary-7b-v2.q4_0.gguf", n_ctx=4096, n_gpu_layers=50)
I have also tried to re-install llama-cpp-python using the instructions below but that didn't help:
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir
My GPU has only 8GB of VRAM, could that be the reason? I saw in the readme that this model requires 24GB of VRAM... However, other models such as Mistral are loading on my GPU just fine. So I am assuming that my Cuda installation is correct.