elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

"File not found" issue when trying to load spaCy model/tokenizer from HF #250

Closed darioprencipe closed 9 months ago

darioprencipe commented 9 months ago

Hello,

I am trying to load spaCy Hugging Face model in a Livebook using Bumblebee.

I run the following:

Mix.install(
  [
    {:bumblebee, "~> 0.4.0"},
    {:exla, ">= 0.0.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "spacy/en_core_web_lg"})

And I get the following error:

** (MatchError) no match of right hand side value: {:error, "file not found"}
    (stdlib 5.0.2) erl_eval.erl:498: :erl_eval.expr/6
    #cell:4xx2e5lot56trpxrg2aqq4xiop2v5pbp:1: (file)

From this error message it's pretty difficult to me to understand:

Thanks! Dario

josevalim commented 9 months ago

We expect some files to exist when we download a repository and it is likely that the repository above does not have the relevant tokenizer configuration. I agree with you though, the error message should be clearer. :)

darioprencipe commented 9 months ago

We expect some files to exist when we download a repository and it is likely that the repository above does not have the relevant tokenizer configuration. I agree with you though, the error message should be clearer. :)

Thanks for getting back. Is there further documentation on this? In general, what's the structure the HF model should comply with in order for it to be importable and usable in Bumblebee?

I can push my own spaCy model to HF so - by knowing which files are expected and how they should look like - I could try to circumvent the issue, unless I'm missing something else.

jonatanklosko commented 9 months ago

Bumblebee expects models that are compatible with the huggingface/transformers library. Unfortunately spaCy is an entirely different library, with its own storage format and design, so I don't think we will be able to support it anytime soon (it may be more fitting as a separate NLP-specific library).

I agree that we should have a more descriptive and actionable error message, I will open up a separate issue :)