llama : save downloaded models to local cache

ggerganov commented 4 months ago

We've recently introduced the --hf-repo and --hf-file helper args to common in https://github.com/ggerganov/llama.cpp/pull/6234:

ref #4735 #5501 #6085 #6098

Sample usage:

./bin/main \
  --hf-repo TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF \
  --hf-file ggml-model-q4_0.gguf \
  -m tinyllama-1.1-v0.2-q4_0.gguf \
  -p "I believe the meaning of life is" -n 32

./bin/main \
  --hf-repo TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
  -m tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
  -p "I believe the meaning of life is" -n 32

Downloads `https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf` and saves it to `tinyllama-1.1-v0.2-q4_0.gguf`

Requires build with `LLAMA_CURL`

Currently, the downloaded files via curl are stored in a destination based on the --model CLI arg.

If --model is not provided, we would like to auto-store the downloaded model files in a local cache, similar to what other frameworks like HF/transformers do.

Here is the documentation of this functionality in HF for convenience and reference:

URL: https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup

### Cache setup

Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:

1. Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
2. Shell environment variable: HF_HOME.
3. Shell environment variable: XDG_CACHE_HOME + /huggingface.

🤗 Transformers will use the shell environment variables PYTORCH_TRANSFORMERS_CACHE or PYTORCH_PRETRAINED_BERT_CACHE if you are coming from an earlier iteration of this library and have set those environment variables, unless you specify the shell environment variable TRANSFORMERS_CACHE.

The goal of this issue is to implement similar functionality in llama.cpp. The environment variables should be named accordingly to the llama.cpp patterns and the local cache should be utilized only when the --model CLI argument is not explicitly provided in commands like main and server

P.S. I'm interested in exercising "Copilot Workspace" to see if it would be capable to implement this task by itself

P.S.2 So CW is quite useless at this point for llama.cpp - it cannot handle files a few thousand lines of code:

CW snapshot: https://copilot-workspace.githubnext.com/ggerganov/llama.cpp/issues/7252?shareId=379fdaa0-3580-46ba-be68-cb061518a38c

julien-c commented 4 months ago

FWIW the HF cache layout is quite nice and it's git-aware:

@lysandrejik and I implemented it a while ago and it's been working well.

For instance this is the layout for one given model repo with two revisions/two files inside of it:

    [  96]  .
    └── [ 160]  models--julien-c--EsperBERTo-small
        ├── [ 160]  blobs
        │   ├── [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
        │   ├── [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
        │   └── [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
        ├── [  96]  refs
        │   └── [  40]  main
        └── [ 128]  snapshots
            ├── [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
            │   ├── [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
            │   └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
            └── [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
                ├── [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
                └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd

ngxson commented 4 months ago

Probably we can take advantage of Hub API. For example, to list all files in a repo: https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B/tree/main

This could potentially remove the need for --hf-file and etag checking

amirzia commented 3 months ago

Hi, this is my first contribution to this project.

I made a PR with a basic implementation of the cache mechanism. The dowloaded files are stored in the directory specified by LLAMA_CACHE env variable. If the env variable is not provided, the models are stored in the default cache directory: .cache/.

Let me know if I'm going to the right direction.

ggerganov commented 3 months ago

@amirzia I think the proposed changes are good - pretty much what I imagined as a first step.

I'm not sure what are the benefits of having a git-aware cache similar to HF, but if we think there are reasonable advantages, we can work on that to improve the functionality further. Maybe for now it's fine to merge the PR as it is

julien-c commented 3 months ago

organic community demand for a shared cached between all local ML apps: https://x.com/filipviz/status/1792981186446274625

amirzia commented 3 months ago

Should we agree on a common standard (layout and path)?

There is already this proposal for a standard path: https://filip.world/post/modelpath/. We also have the HF git-aware layout (which Julien seems to really like 😄).

Although I'm not sure if llama.cpp and other applications benefit from having the history of models.

ggerganov commented 3 months ago

Ah I see now. The shared location seems reasonable in order to have different apps sharing the same model data.

Although I'm not sure if llama.cpp and other applications benefit from having the history of models.

I also don't think that llama.cpp has use cases for the git-aware structure and it might not be trivial to implement in C++. Filesystem operations are real pain in C++

ggerganov / llama.cpp

llama : save downloaded models to local cache #7252