Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is Jina Reranker. The eland_import_hub_model script fails with this error when the file is missing:
Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
This happens because AutoTokenizer.from_pretrained(...) is called with use_fast=False.
It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.
eland_import_hub_model --cloud-id labs:xxxxxx== --hub-model-id jinaai/jina-reranker-v2-base-multilingual --task-type text_similarity --es-api-key xxxx== --start --clear-previous
And I'm getting this error:
2024-09-03 01:59:53,443 INFO : Establishing connection to Elasticsearch
2024-09-03 01:59:53,940 INFO : Connected to cluster named 'XXX' (version: 8.15.0)
2024-09-03 01:59:53,942 INFO : Loading HuggingFace transformer tokenizer and model 'jinaai/jina-reranker-v2-base-multilingual'
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 154, in __init__
self.sp_model.Load(str(vocab_file))
File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 298, in main
tm = TransformerModel(
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 655, in __init__
self._tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2258, in _from_pretrained
raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is Jina Reranker. The
eland_import_hub_model
script fails with this error when the file is missing:This happens because
AutoTokenizer.from_pretrained(...)
is called withuse_fast=False
.It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.