Closed lawik closed 3 months ago
The load model should be fast compared to the download, as it does not compile anything, and it will help validate you have downloaded the right artifact. And I believe safe tensors, which we want to make the default, load parameters lazily, so it should be even less work.
Maybe the right idea but this makes it kind of rough if you just want to load/upload the model on machine that isn't set up for inference. Or at least with the times I'm seeing.
Example: https://huggingface.co/google-bert/bert-base-cased
Download, clocked by counting out loud while the progress bars were going:
~14 seconds
Total load_model
execution time:
76 seconds
Tested with:
:timer.tc(fn -> Bumblebee.load_model({:hf, "google-bert/bert-base-cased"}) end) |> elem(0) |> then(& &1 / 1000) |> IO.inspect(label: "ms")
No configuration done at all.
Ideally I'd love to stream the download from hugging face to an S3-compatible but that is further out of scope from what Bumblebee is about.
You can take files from HF repository and put in S3 or wherever, then when you download onto the local machine use {:file, path_to_repo_dir}
(just make sure you don't copy parameter files in multiple formats, as that would be unnecessary).
In the future we may have our own serialisation format for things, but I don't think we should be exposing the download of hf/transformers files.
76 seconds
You'd need to use EXLA.Backend, because there are some transformations that are going to be slow otherwise.
That was a lot faster. I can make do.
Hey
I was looking to set up something where I am loading models from a nearby S3 bucket, or even using an S3 as a pass-through cache for models. And I realized there are no public functions for triggering just the download so when I try to download it I also have to load it which in some cases takes more time than the download and is entirely unnecessary for the purposes of then putting it somewhere else :D
I can use private APIs to make some progress but I'm essentially re-implementing stuff already there.
If most of load_model could be broken out to be available as a download_model. Probably same with other load_X things. Then it would be fairly easy to add options for people who don't want to hassle HuggingFace too much.