Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
20.58k stars 1.04k forks source link

Granite three support #608

Open gabe-l-hart opened 2 weeks ago

gabe-l-hart commented 2 weeks ago

Description

This PR adds support for the "granite" and "granitemoe" architectures in order to support IBM's Granite 3.0. The changes mirror those added in llama.cpp upstream:

These models are currently available via HuggingFace and Ollama:

Testing

I did my development on a Mac M3 without gmake natively installed. To avoid a system-level install, I wrapped my dev environment in docker with the following two scripts:

build_dockerized.sh ```sh #!/usr/bin/env bash cd $(dirname ${BASH_SOURCE[0]}) docker buildx build . -t llamafile-builder:latest --load docker run --rm -it --entrypoint bash -w /src -v $PWD:/src -v $HOME/models:/models llamafile-builder:latest ```
build_in_docker.sh ```sh #!/usr/bin/env bash gguf_file=$1 if [ $# -ge 2 ] then model_name=$2 else model_name=$(basename $gguf_file | cut -d'.' -f 1) fi echo "Model Name: $model_name" # Build (NOTE: First build may fail due to the need to download tools) make -j || make -j # Install the built binaries make install PREFIX=/usr/local # Make a temp dir to work in start_dir=$PWD temp_dir=$(mktemp -d) cd $temp_dir # Copy over the model and base binary echo "Copying source materials..." cp $gguf_file . cp $(which llamafile) $model_name.llamafile # Make the .args file echo "Making .args file..." echo "-m $(basename $gguf_file) --host 0.0.0.0 -ngl 9999 ..." > .args # Pack it all together echo "Packing with zipalign..." zipalign -j0 $model_name.llamafile $(basename $gguf_file) .args # Move it back to the root dir mv $model_name.llamafile $start_dir/ echo "DONE" ```

With these scripts, my workflow was:

  1. Download pre-quantized versions of the models (e.g. ollama pull then grab the $HOME/.ollama/models/blobs/... blob for the GGUF file)
    • NOTE: IBM does not currently host official quantized versions, but there are also many community quantizations available in HF (dense, moe)
  2. Launch the docker build shell (./build_dockerized.sh)
  3. Build the llamafile inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b)
  4. Run the llamafile outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story")

Open Questions

Solved! I found the PR added after mine in llama.cpp to update the chat template to support "granite": https://github.com/ggerganov/llama.cpp/pull/10013

When running in interactive mode, the chat template seems to be using different special tokens besides those defined in the chat_template metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.

DK013 commented 2 weeks ago

I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart

BradHutchings commented 2 weeks ago

Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both!

-Brad

BradHutchings commented 1 week ago

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

gabe-l-hart commented 1 week ago

Hi @jart! I wanted to check in and see if this PR is something you would consider for upstream merging. I see that you use llama.cpp/README.llamafile to track the version of llama.cpp being used and the list of local modifications on top. I didn't see a clean way to re-bump the commit and apply those deltas, but I'd be happy to re-do this change set to be a full llama.cpp bump if that's preferred.