Closed undefdev closed 1 year ago
Hello,
I inherit model_architecture from llm
crate, the correct architecture is : "llama". If you have issues here is the exact command I run on my M1 :
./target/release/cria llama ~/Downloads/llama-7b.ggmlv3.q4_0.bin --use-gpu --gpu-layers 32
What weights are you using ?
Hi,
I'm using a q2_K quantized llama-70b finetune. Does the llm
crate use the latest llama.cpp?
Yes it does does. What GPU are you using ?
I'm using an M1 Max with 64gb ram. It works fine with llama.cpp, although for llama2 models of this size -gqa 8
(graph query attention) needs to be set. Could this be the problem?
same here
just got it working like this:
./cria Llama lama.bin --use-gpu --gpu-layers 32
mind the capital L
@undefdev : Finally got my hands on a machine with A100 where I could test loading the 70B model. The issue comes from grouped query attention params that is not passed to the llm crate. I am working on a fix right now. Shoud be available very soon !
Hello there !
Great news, cria
now supports the Llama-2-70B model ! The PR has been accepted and merged in llm
crate. Also there is no need to use my patched version of llm anymore 😄 !
Here are the steps to load the 70B model :
git clone git@github.com:AmineDiro/cria.git
cd cria/
cargo b --release --features cublas
./target/cria -a llama --model {MODEL_BIN_PATH} -u -g 83 --n-gqa 8
I'm getting this error when trying to run on MacOS:
If I use
LLama
instead, it crashes (as it probably should)