alan-turing-institute / reginald

Reginald repository for REG Hack Week 23
3 stars 0 forks source link

Replace LLM with an open-source model in llama-index #65

Closed rchan26 closed 1 year ago

rchan26 commented 1 year ago

The llama-index model currently uses chatgpt-3.5 via the OpenAI API. We would like to replace this with an open-source model, like falcon, or llama2.

rwood-97 commented 1 year ago

To get llama2 working on Mac you need a quantised model ggml-model-q4_0.bin

Two options are to get them directly from Hugging Face - The Bloke (note: you need the q4_0 one) or to get them from Meta and quantise them yourself using llama.cpp.

To quantize yourself:

  1. Go here and request a download link.
  2. Download the model you want.
  3. Run llama.cpp
    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    git checkout b5ffb28 ## <-- in an older version (which I used) you would convert to .bin files
    LLAMA_METAL=1 make
    python convert-pth-to-ggml.py llama-2-7b-chat/ 1
    ./quantize llama-2-7b-chat/ggml-model-f16.bin llama-2-7b-chat/ggml-model-q4_0.bin q4_0

Then you should have a ggml-model-q4_0.bin which you can use on your mac.

rwood-97 commented 1 year ago

There is also llama-cpp-python which is a python binding for llama.cpp so you can do from python script.

rwood-97 commented 1 year ago

See #66 for notebooks of how to run llama2 + llama_index.

rchan26 commented 1 year ago

Nice work @rwood-97! Looks great!

I'm just going to bump up the llama-index and llama-cpp-python versions in #66 so that we start using the new model format gguf rather than ggml (some info here about the difference between them). It looks like llama.cpp (and llama-cpp-python) will only be supporting gguf in the future, so would be good to try use the latest now and make the change.

I think the main changes we need to do now are to convert any ggml files we are currently using (e.g. any that we use from TheBloke, or any we create using llama.cpp). We can use the convert-llama-ggmlv3-to-gguf.py script from the llama.cpp repo. I hope from there, it's relatively straightforward after that to change the notebook examples we have here to work with gguf files.

I've just merged my branch into yours which uses the latest llama-index and llama-cpp-python.

rchan26 commented 1 year ago

We've figured out how to use llama-cpp-python with llama-index to work with quantized versions of Meta's Llama chat models. Lead to a contribution of the documentation in llama-index regarding using LlamaCPP: https://github.com/jerryjliu/llama_index/pull/7616

Work pretty well and possible to run locally on our machines: a 6-bit version of Llama-2-7b-Chat-GGUF requires around 8GB whilst a 6-bit version of Llama-2-13B-chat-GGUF requires about 13GB. The largest Llama-2-70B-chat-GGUF model requires about 60GB for 6-bit and 75GB for 8-bit.