elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
639 stars 98 forks source link

Support for TaylorAI/gte-tiny #680

Closed Shifter2600 closed 2 months ago

Shifter2600 commented 6 months ago

Receiving an error when loading model

Downloading: 100%|██████████████████████████████████████████████████████████████████████| 1.50k/1.50k [00:00<00:00, 980kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████| 226k/226k [00:00<00:00, 1.97MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████| 82.0/82.0 [00:00<00:00, 92.6kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████| 228/228 [00:00<00:00, 111kB/s]
Traceback (most recent call last):
  File "/usr/local/bin/eland_import_hub_model", line 197, in <module>
    tm = TransformerModel(args.hub_model_id, args.task_type, args.quantize)
  File "/usr/local/lib/python3.9/dist-packages/eland/ml/pytorch/transformers.py", line 567, in __init__
    self._tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/auto/tokenization_auto.py", line 579, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/tokenization_utils_base.py", line 1783, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.9/dist-packages/transformers/tokenization_utils_base.py", line 1984, in _from_pretrained
    raise ValueError(
ValueError: Non-consecutive added token '[PAD]' found. Should have index 30522 but has index 0 in saved vocabulary.
davidkyle commented 2 months ago

I was able to install this model using the 8.14 docker image

docker pull docker.elastic.co/eland/eland:8.14.0

And installed with:

docker run -it --rm docker.elastic.co/eland/eland \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u elastic -p $CLOUD_PWD \
      --hub-model-id 'TaylorAI/gte-tiny' \
      --task-type text_embedding 

Closing issue as the error comes from the Transformers library and appears to be fixed now