Closed bianchidotdev closed 1 year ago
Usually we look for a different repository that uses the same tokenizer and has tokenizer.json
. In this case you can try yhyhy3/open_llama_7b_v2_med_instruct, which is fine-tuned version of the original repo and likely uses the same tokenizer.
According to this paragraph the "fast tokenizer" (dumped/loaded from tokenizer.json
) used to give wrong results, but this seems to have been resolved in https://github.com/huggingface/transformers/issues/24233.
We can send a PR to the hf repo with the tokenizer file, which we did for a couple repos in the past, so I will keep this open :)
Thanks a lot! I was looking around pretty hard for a 1B or 3B model to test with on my laptop since I don't have the memory really needed to run with a 7B+ model but that makes sense.
For my own reference and usage, is generating a tokenizer.json
as simple as it seems with the following:
# python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("<model>")
tokenizer.save_pretrained("<dir_to_save_to>")
Do you have any sense for the work to handle the model file natively in elixir is?
@bianchidotdev this is precisely it! When you call AutoTokenizer.from_pretrained
it will fetch the vocab/config/merges files and create a "slow tokenizer", then they attempt to convert it to a fast tokenizer if possible. If the conversion works, then tokenizer
is "fast tokenizer" and save_pretrained
dumps it into tokenizer.json
, which is the file we rely on.
If you want to open a PR on the HF repos, here's an example, just make sure you have latest transformers installed locally before doing the conversion. No pressure though, I can also do it later :)
@bianchidotdev I opened a PR while testing a new conversion tool and I noticed you opened one already, thanks!
FTR you don't have to wait for the PR to be merged, you can just reference the PR commit directly:
{:ok, tokenizer} =
Bumblebee.load_tokenizer(
{:hf, "openlm-research/open_llama_3b_v2",
revision: "52944fc4e35e6ca00e733b95df79498728016e1d"}
)
Also, I improved the error messages in #256, so it will be clear why the tokenizer cannot be loaded. And we have a new section in the README with actions the user may take :)
I'm having issues loading certain models on Huggingface that might largely be an issue with those repos rather than bumblebee.
What I'm seeing:
It looks like it's failing searching for a
tokenizer.json
. Unfortunately, the huggingface repo ships only with atokenizer.model
and related config files but not atokenizer.json
and it appears quite a few models on huggingface follow suit.I'm not sure what the effort would be to support loading the model directly or if there are other ways around this.