AmineDiro / cria

OpenAI compatible API for serving LLAMA-2 model
MIT License
212 stars 12 forks source link

llama-2 70B support #2

Closed undefdev closed 1 year ago

undefdev commented 1 year ago

I'm getting this error when trying to run on MacOS:

error: invalid value 'llama-2' for '<MODEL_ARCHITECTURE>': llama-2 is not one of supported model architectures: [Bloom, Gpt2, GptJ, GptNeoX, Llama, Mpt]

If I use LLama instead, it crashes (as it probably should)

GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
fish: Job 1, 'target/release/cria Llama ../ll…' terminated by signal SIGABRT (Abort)
AmineDiro commented 1 year ago

Hello,

I inherit model_architecture from llm crate, the correct architecture is : "llama". If you have issues here is the exact command I run on my M1 :

./target/release/cria llama ~/Downloads/llama-7b.ggmlv3.q4_0.bin --use-gpu --gpu-layers 32

What weights are you using ?

undefdev commented 1 year ago

Hi,

I'm using a q2_K quantized llama-70b finetune. Does the llm crate use the latest llama.cpp?

AmineDiro commented 1 year ago

Yes it does does. What GPU are you using ?

undefdev commented 1 year ago

I'm using an M1 Max with 64gb ram. It works fine with llama.cpp, although for llama2 models of this size -gqa 8 (graph query attention) needs to be set. Could this be the problem?

kitalia commented 1 year ago

same here

kitalia commented 1 year ago

just got it working like this:

./cria Llama lama.bin --use-gpu --gpu-layers 32

mind the capital L

AmineDiro commented 1 year ago

@undefdev : Finally got my hands on a machine with A100 where I could test loading the 70B model. The issue comes from grouped query attention params that is not passed to the llm crate. I am working on a fix right now. Shoud be available very soon !

AmineDiro commented 1 year ago

Hello there ! Great news, cria now supports the Llama-2-70B model ! The PR has been accepted and merged in llm crate. Also there is no need to use my patched version of llm anymore 😄 !

Here are the steps to load the 70B model :

git clone git@github.com:AmineDiro/cria.git
cd cria/
cargo b --release --features cublas
./target/cria -a llama --model {MODEL_BIN_PATH} -u -g 83 --n-gqa 8