abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.16k stars 970 forks source link

`Llama.from_pretrained` should work with `HF_HUB_OFFLINE=1` #1801

Open davidgilbertson opened 1 month ago

davidgilbertson commented 1 month ago

Is your feature request related to a problem? Please describe. Even with a model downloaded, the package attempts a call to HF HUB, which increases the load time.

From a quick scan of the logic here, it seems that the code just wants to check that the filename provided is in the repo provided.

Describe the solution you'd like If you skipped that check and just assumed that the file existed and called hf_hub_download, that function would handle the case of errors if it couldn't find the file in the given repo.

The error may not be quite as focused, but init would run in a third the time.

On my machine:

Describe alternatives you've considered The workaround is to use from_pretrained to download the appropriate file (if I want to do it all in Python), then get the cached file location and pass that as model_path to Llama without using from_pretrained.

Additional context For work with HF models, I have HF_HUB_OFFLINE=1 set by default, only turning it off when I need a new model (because a few HF operations like to make checks for model info that require network requests, even with cache primed). It would be great if this was compatible with llama-cpp-python.

Side note: I just started using this today and was delighted with how easy it was to install, with CUDA support, from a single pip command. Nice work.