Is your feature request related to a problem? Please describe.
Even with a model downloaded, the package attempts a call to HF HUB, which increases the load time.
From a quick scan of the logic here, it seems that the code just wants to check that the filename provided is in the repo provided.
Describe the solution you'd like
If you skipped that check and just assumed that the file existed and called hf_hub_download, that function would handle the case of errors if it couldn't find the file in the given repo.
The error may not be quite as focused, but init would run in a third the time.
On my machine:
loading from cache takes 400ms
loading from cache with this additional check of available files in the repo takes 1,200ms
Describe alternatives you've considered
The workaround is to use from_pretrained to download the appropriate file (if I want to do it all in Python), then get the cached file location and pass that as model_path to Llama without using from_pretrained.
Additional context
For work with HF models, I have HF_HUB_OFFLINE=1 set by default, only turning it off when I need a new model (because a few HF operations like to make checks for model info that require network requests, even with cache primed). It would be great if this was compatible with llama-cpp-python.
Side note: I just started using this today and was delighted with how easy it was to install, with CUDA support, from a single pip command. Nice work.
Is your feature request related to a problem? Please describe. Even with a model downloaded, the package attempts a call to HF HUB, which increases the load time.
From a quick scan of the logic here, it seems that the code just wants to check that the filename provided is in the repo provided.
Describe the solution you'd like If you skipped that check and just assumed that the file existed and called
hf_hub_download
, that function would handle the case of errors if it couldn't find the file in the given repo.The error may not be quite as focused, but init would run in a third the time.
On my machine:
Describe alternatives you've considered The workaround is to use
from_pretrained
to download the appropriate file (if I want to do it all in Python), then get the cached file location and pass that asmodel_path
toLlama
without usingfrom_pretrained
.Additional context For work with HF models, I have
HF_HUB_OFFLINE=1
set by default, only turning it off when I need a new model (because a few HF operations like to make checks for model info that require network requests, even with cache primed). It would be great if this was compatible withllama-cpp-python
.Side note: I just started using this today and was delighted with how easy it was to install, with CUDA support, from a single pip command. Nice work.