Closed rchan26 closed 1 year ago
To get llama2 working on Mac you need a quantised model ggml-model-q4_0.bin
Two options are to get them directly from Hugging Face - The Bloke (note: you need the q4_0 one) or to get them from Meta and quantise them yourself using llama.cpp.
To quantize yourself:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b5ffb28 ## <-- in an older version (which I used) you would convert to .bin files
LLAMA_METAL=1 make
python convert-pth-to-ggml.py llama-2-7b-chat/ 1
./quantize llama-2-7b-chat/ggml-model-f16.bin llama-2-7b-chat/ggml-model-q4_0.bin q4_0
Then you should have a ggml-model-q4_0.bin
which you can use on your mac.
There is also llama-cpp-python which is a python binding for llama.cpp so you can do from python script.
See #66 for notebooks of how to run llama2 + llama_index.
Nice work @rwood-97! Looks great!
I'm just going to bump up the llama-index
and llama-cpp-python
versions in #66 so that we start using the new model format gguf
rather than ggml
(some info here about the difference between them). It looks like llama.cpp
(and llama-cpp-python
) will only be supporting gguf
in the future, so would be good to try use the latest now and make the change.
I think the main changes we need to do now are to convert any ggml
files we are currently using (e.g. any that we use from TheBloke, or any we create using llama.cpp
). We can use the convert-llama-ggmlv3-to-gguf.py
script from the llama.cpp
repo. I hope from there, it's relatively straightforward after that to change the notebook examples we have here to work with gguf
files.
I've just merged my branch into yours which uses the latest llama-index and llama-cpp-python.
We've figured out how to use llama-cpp-python
with llama-index to work with quantized versions of Meta's Llama chat models. Lead to a contribution of the documentation in llama-index regarding using LlamaCPP: https://github.com/jerryjliu/llama_index/pull/7616
Work pretty well and possible to run locally on our machines: a 6-bit version of Llama-2-7b-Chat-GGUF requires around 8GB whilst a 6-bit version of Llama-2-13B-chat-GGUF requires about 13GB. The largest Llama-2-70B-chat-GGUF model requires about 60GB for 6-bit and 75GB for 8-bit.
The llama-index model currently uses chatgpt-3.5 via the OpenAI API. We would like to replace this with an open-source model, like falcon, or llama2.