"File not found" issue when trying to load spaCy model/tokenizer from HF

darioprencipe commented 9 months ago

Hello,

I am trying to load spaCy Hugging Face model in a Livebook using Bumblebee.

I run the following:

Mix.install(
  [
    {:bumblebee, "~> 0.4.0"},
    {:exla, ">= 0.0.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "spacy/en_core_web_lg"})

And I get the following error:

** (MatchError) no match of right hand side value: {:error, "file not found"}
    (stdlib 5.0.2) erl_eval.erl:498: :erl_eval.expr/6
    #cell:4xx2e5lot56trpxrg2aqq4xiop2v5pbp:1: (file)

From this error message it's pretty difficult to me to understand:

if Bumblebee doesn't support this model (if not, then my question is: how do I know whether a HF model is supported by Bumblebee or not)
if Bumblebee supports it but something breaks along the HF import.
Bumblebee supports it but there's - somehow - an issue with how Hugging Face packaged this.

Thanks! Dario

josevalim commented 9 months ago

We expect some files to exist when we download a repository and it is likely that the repository above does not have the relevant tokenizer configuration. I agree with you though, the error message should be clearer. :)

darioprencipe commented 9 months ago

We expect some files to exist when we download a repository and it is likely that the repository above does not have the relevant tokenizer configuration. I agree with you though, the error message should be clearer. :)

Thanks for getting back. Is there further documentation on this? In general, what's the structure the HF model should comply with in order for it to be importable and usable in Bumblebee?

I can push my own spaCy model to HF so - by knowing which files are expected and how they should look like - I could try to circumvent the issue, unless I'm missing something else.

jonatanklosko commented 9 months ago

Bumblebee expects models that are compatible with the huggingface/transformers library. Unfortunately spaCy is an entirely different library, with its own storage format and design, so I don't think we will be able to support it anytime soon (it may be more fitting as a separate NLP-specific library).

I agree that we should have a more descriptive and actionable error message, I will open up a separate issue :)

elixir-nx / bumblebee

"File not found" issue when trying to load spaCy model/tokenizer from HF #250