Feature Request: Automate update of upstream llama.cpp

maruel commented 2 weeks ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Keep up with llama.cpp upstream via automation done with scripts.

Motivation

The last llama.cpp update in llamafile was from https://github.com/ggerganov/llama.cpp/commit/152da28ae54139e3754189b9e6e1c28e11277502 which is from May. There's at least one fix I need from a more recent commit (BF16 on metal).

There are a fair number of local modifications as described in llama.cpp/README.llamafile but it's tricky to get the exact diff since the changes were intermixed with source updates.

Automating the upstream update will lower toil as the project will want to keep up with llama.cpp for the foreseeable future and having the manual steps automated will enable non-core contributors help with staying up to date with upstream, or even do it automatically via github actions.

It would be especially useful to since upstream llama.cpp was refactored significantly in https://github.com/ggerganov/llama.cpp/commit/f3f65429c44bb195a9195bfdc19a30a79709db7b, which makes the update more daunting for a new contributor. :) This means that when there's an upstream breaking change like this commit, the script would have to be updated. This is a good thing, since it formally describe the changes in the steps needed to sync with upstream.

Possible Implementation

Option 1

Add a new update_upstream.sh script (or another language like python). It:

Copies files from upstream
Applies a set of patch files stored under patches/
Update README.llamafile with the upstream commit hash and the name of each of the patches applied and the files modified.

Add a second script to generate a new patch file when a new local modification is done.

Advantage: no need to use git submodule, fairly straightforward.

Option 2

Use a git submodule in llamafile.git plus a proper llama.cpp.git fork, then have the CMake in llamafile.git select the subset of the sources needed.

Advantage: it makes keeping up with upstream's breaking change easier. Disadvantage: a bit more "infrastructure" as a new git repository is needed, e.g. https://github.com/Mozilla-Ocho/llama.cpp

Option 3

Use a git submodule with unpatched upstream, then CMake dynamically applies patches as intermediary files.

Disadvantage: It would be a bit more obscure to debug.

jart commented 2 weeks ago

I don't think automating sync is realistically going to happen. Upstream made a lot of changes we can't agree to, such as CUDA code size being too large, server having LLaVA support removed, etc.

Tell me what changes you need cherry-picked. I'm happy to make that a priority for you, any time. Metal for example should be super easy to merge, since I've never needed to change anything about that. It's exciting to hear it supports BF16 now. Is there anything else you'd like me to focus on?

maruel commented 2 weeks ago

Oh I'm sorry to hear. Do I understand you worry you'd spend more time reverting changes as you merge, so you prefer to just cherry-pick what you need? I can understand the tension. Do you prefer me to file a separate request for the cherry-pick? I'll add them here in case you don't mind, otherwise just close this issue and I'll file another one.

I struggled with the new Phi-3 with llamafile, I suspect (but not sure) this and its dependencies would help: https://github.com/ggerganov/llama.cpp/commit/916248af1f3c16abd7408de848e025da095c621c

The commit I was thinking about for metal / BF16 is https://github.com/ggerganov/llama.cpp/commit/2075a66a96cc1b04eabec7cf4b3051193d6f719e. Right now it asserts GGML_ASSERT: $HOME/.llamafile/v/0.8.9/ggml-metal.m:1580: false && "MUL MAT-MAT not implemented". This enables running on the CPU which is super slow but at least doesn't fail.

There seem to be a fair chunk of random improvements too but the two above are what I worry most about. Thanks!

vlasky commented 2 weeks ago

@jart please cherry pick ggerganov/llama.cpp@0642b22 to support relative routes for static files and a custom api_url configuration option when running in server mode. Been waiting on this for a while.

mofosyne commented 2 weeks ago

Is there anyway to make it as easy to do a partial sync or something? E.g. cleaner separation between module so you can sync just part of the code base.

jart commented 2 days ago

Thanks for your patience! I've got the BF16 fix in for you. It'll be rolled out in the release. Whenever you need anything merged please do take the time to file an issue. File as many as you want and we'll triage the merge / cherry-pick requests accordingly! You're also encouraged to join the Mozilla Discord mentioned in README and nag me when you need it sooner. Thanks again and enjoy using llamafile!

jart commented 1 day ago

Also if anyone has any ideas on how we might go about solving the issue properly, by implementing Apple Metal GPU support for BF16, I'm willing to take a crack at it. Getting GGML CUDA to support BF16 at acceptable (but not great) performance was the most trivial thing imaginable, but I'm really scratching my head with Metal and lack familiarity with it. Tips are welcome!

Mozilla-Ocho / llamafile