Issues loading tokenizer/Support loading tokenizer.model?

elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)

Apache License 2.0

1.33k stars 96 forks source link

Issues loading tokenizer/Support loading tokenizer.model? #239

Closed bianchidotdev closed 1 year ago

bianchidotdev commented 1 year ago

I'm having issues loading certain models on Huggingface that might largely be an issue with those repos rather than bumblebee.

What I'm seeing:

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openlm-research/open_llama_3b_v2"})

** (MatchError) no match of right hand side value: {:error, "file not found"}

It looks like it's failing searching for a tokenizer.json. Unfortunately, the huggingface repo ships only with a tokenizer.model and related config files but not a tokenizer.json and it appears quite a few models on huggingface follow suit.

I'm not sure what the effort would be to support loading the model directly or if there are other ways around this.

jonatanklosko commented 1 year ago

Usually we look for a different repository that uses the same tokenizer and has tokenizer.json. In this case you can try yhyhy3/open_llama_7b_v2_med_instruct, which is fine-tuned version of the original repo and likely uses the same tokenizer.

According to this paragraph the "fast tokenizer" (dumped/loaded from tokenizer.json) used to give wrong results, but this seems to have been resolved in https://github.com/huggingface/transformers/issues/24233.

We can send a PR to the hf repo with the tokenizer file, which we did for a couple repos in the past, so I will keep this open :)

bianchidotdev commented 1 year ago

Thanks a lot! I was looking around pretty hard for a 1B or 3B model to test with on my laptop since I don't have the memory really needed to run with a 7B+ model but that makes sense.

For my own reference and usage, is generating a tokenizer.json as simple as it seems with the following:

# python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("<model>")
tokenizer.save_pretrained("<dir_to_save_to>")

Do you have any sense for the work to handle the model file natively in elixir is?

jonatanklosko commented 1 year ago

@bianchidotdev this is precisely it! When you call AutoTokenizer.from_pretrained it will fetch the vocab/config/merges files and create a "slow tokenizer", then they attempt to convert it to a fast tokenizer if possible. If the conversion works, then tokenizer is "fast tokenizer" and save_pretrained dumps it into tokenizer.json, which is the file we rely on.

If you want to open a PR on the HF repos, here's an example, just make sure you have latest transformers installed locally before doing the conversion. No pressure though, I can also do it later :)

jonatanklosko commented 1 year ago

@bianchidotdev I opened a PR while testing a new conversion tool and I noticed you opened one already, thanks!

FTR you don't have to wait for the PR to be merged, you can just reference the PR commit directly:

{:ok, tokenizer} =
  Bumblebee.load_tokenizer(
    {:hf, "openlm-research/open_llama_3b_v2",
     revision: "52944fc4e35e6ca00e733b95df79498728016e1d"}
  )

jonatanklosko commented 1 year ago

Also, I improved the error messages in #256, so it will be clear why the tokenizer cannot be loaded. And we have a new section in the README with actions the user may take :)