Open gabe-l-hart opened 2 weeks ago
I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart
Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both!
-Brad
I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!
Hi @jart! I wanted to check in and see if this PR is something you would consider for upstream merging. I see that you use llama.cpp/README.llamafile to track the version of llama.cpp
being used and the list of local modifications on top. I didn't see a clean way to re-bump the commit and apply those deltas, but I'd be happy to re-do this change set to be a full llama.cpp
bump if that's preferred.
Description
This PR adds support for the
"granite"
and"granitemoe"
architectures in order to support IBM's Granite 3.0. The changes mirror those added inllama.cpp
upstream:"granite"
: https://github.com/ggerganov/llama.cpp/pull/9412"granitemoe"
: https://github.com/ggerganov/llama.cpp/pull/9438These models are currently available via HuggingFace and Ollama:
granite3-dense
("granite"
): https://ollama.com/library/granite3-densegranite3-moe
("granitemoe"
): https://ollama.com/library/granite3-moeTesting
I did my development on a Mac M3 without
gmake
natively installed. To avoid a system-level install, I wrapped my dev environment indocker
with the following two scripts:build_dockerized.sh
```sh #!/usr/bin/env bash cd $(dirname ${BASH_SOURCE[0]}) docker buildx build . -t llamafile-builder:latest --load docker run --rm -it --entrypoint bash -w /src -v $PWD:/src -v $HOME/models:/models llamafile-builder:latest ```build_in_docker.sh
```sh #!/usr/bin/env bash gguf_file=$1 if [ $# -ge 2 ] then model_name=$2 else model_name=$(basename $gguf_file | cut -d'.' -f 1) fi echo "Model Name: $model_name" # Build (NOTE: First build may fail due to the need to download tools) make -j || make -j # Install the built binaries make install PREFIX=/usr/local # Make a temp dir to work in start_dir=$PWD temp_dir=$(mktemp -d) cd $temp_dir # Copy over the model and base binary echo "Copying source materials..." cp $gguf_file . cp $(which llamafile) $model_name.llamafile # Make the .args file echo "Making .args file..." echo "-m $(basename $gguf_file) --host 0.0.0.0 -ngl 9999 ..." > .args # Pack it all together echo "Packing with zipalign..." zipalign -j0 $model_name.llamafile $(basename $gguf_file) .args # Move it back to the root dir mv $model_name.llamafile $start_dir/ echo "DONE" ```With these scripts, my workflow was:
ollama pull
then grab the$HOME/.ollama/models/blobs/...
blob for the GGUF file)./build_dockerized.sh
)llamafile
inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b
)llamafile
outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story"
)Open Questions
Solved! I found the PR added after mine in
llama.cpp
to update the chat template to support"granite"
: https://github.com/ggerganov/llama.cpp/pull/10013When running in interactive mode, the chat template seems to be using different special tokens besides those defined in thechat_template
metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.