Open ggerganov opened 4 months ago
FWIW the HF cache layout is quite nice and it's git-aware:
@lysandrejik and I implemented it a while ago and it's been working well.
For instance this is the layout for one given model repo with two revisions/two files inside of it:
[ 96] .
└── [ 160] models--julien-c--EsperBERTo-small
├── [ 160] blobs
│ ├── [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
│ ├── [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e
│ └── [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812
├── [ 96] refs
│ └── [ 40] main
└── [ 128] snapshots
├── [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f
│ ├── [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
│ └── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
└── [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48
├── [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
└── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
Probably we can take advantage of Hub API. For example, to list all files in a repo: https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B/tree/main
This could potentially remove the need for --hf-file
and etag
checking
Hi, this is my first contribution to this project.
I made a PR with a basic implementation of the cache mechanism. The dowloaded files are stored in the directory specified by LLAMA_CACHE
env variable. If the env variable is not provided, the models are stored in the default cache directory: .cache/
.
Let me know if I'm going to the right direction.
@amirzia I think the proposed changes are good - pretty much what I imagined as a first step.
I'm not sure what are the benefits of having a git-aware cache similar to HF, but if we think there are reasonable advantages, we can work on that to improve the functionality further. Maybe for now it's fine to merge the PR as it is
organic community demand for a shared cached between all local ML apps: https://x.com/filipviz/status/1792981186446274625
Should we agree on a common standard (layout and path)?
There is already this proposal for a standard path: https://filip.world/post/modelpath/. We also have the HF git-aware layout (which Julien seems to really like 😄).
Although I'm not sure if llama.cpp and other applications benefit from having the history of models.
Ah I see now. The shared location seems reasonable in order to have different apps sharing the same model data.
Although I'm not sure if llama.cpp and other applications benefit from having the history of models.
I also don't think that llama.cpp
has use cases for the git-aware structure and it might not be trivial to implement in C++. Filesystem operations are real pain in C++
We've recently introduced the
--hf-repo
and--hf-file
helper args tocommon
in https://github.com/ggerganov/llama.cpp/pull/6234:Currently, the downloaded files via
curl
are stored in a destination based on the--model
CLI arg.If
--model
is not provided, we would like to auto-store the downloaded model files in a local cache, similar to what other frameworks like HF/transformers do.Here is the documentation of this functionality in HF for convenience and reference:
URL: https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup
The goal of this issue is to implement similar functionality in
llama.cpp
. The environment variables should be named accordingly to thellama.cpp
patterns and the local cache should be utilized only when the--model
CLI argument is not explicitly provided in commands likemain
andserver
P.S. I'm interested in exercising "Copilot Workspace" to see if it would be capable to implement this task by itself
P.S.2 So CW is quite useless at this point for
llama.cpp
- it cannot handle files a few thousand lines of code:CW snapshot: https://copilot-workspace.githubnext.com/ggerganov/llama.cpp/issues/7252?shareId=379fdaa0-3580-46ba-be68-cb061518a38c