elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
627 stars 98 forks source link

why openai/clip-vit-base-patch32 model not support ! #662

Closed RobinYang11 closed 4 months ago

RobinYang11 commented 5 months ago
eland_import_hub_model --url  https://elastic:q=MYL8TsnSVUhmlwOIWa@localhost:9200 \
 --hub-model-id  openai/clip-vit-base-patch32   \
 --task-type text_embedding  \
 --ca-certs /Users/robinyang/http_ca.crt  \
  --start

2024-02-10 16:04:17,677 INFO : Establishing connection to Elasticsearch 2024-02-10 16:04:17,728 INFO : Connected to cluster named 'docker-cluster' (version: 8.12.0) 2024-02-10 16:04:17,729 INFO : Loading HuggingFace transformer tokenizer and model 'openai/clip-vit-base-patch32' Traceback (most recent call last): File "/Users/robinyang/Library/Python/3.9/bin/eland_import_hub_model", line 8, in sys.exit(main()) File "/Users/robinyang/Library/Python/3.9/lib/python/site-packages/eland/cli/eland_import_hub_model.py", line 254, in main tm = TransformerModel( File "/Users/robinyang/Library/Python/3.9/lib/python/site-packages/eland/ml/pytorch/transformers.py", line 649, in init raise TypeError( TypeError: Tokenizer type CLIPTokenizer**(name_or_path='openai/clip-vit-base-patch32', vocab_size=49408, model_max_length=77, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True) not supported**, must be one of: <class 'transformers.models.bart.tokenization_bart.BartTokenizer'>, <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>, <class 'transformers.models.bert_japanese.tokenization_bert_japanese.BertJapaneseTokenizer'>, <class 'transformers.models.deprecated.retribert.tokenization_retribert.RetriBertTokenizer'>, <class 'transformers.models.distilbert.tokenization_distilbert.DistilBertTokenizer'>, <class 'transformers.models.dpr.tokenization_dpr.DPRContextEncoderTokenizer'>, <class 'transformers.models.dpr.tokenization_dpr.DPRQuestionEncoderTokenizer'>, <class 'transformers.models.electra.tokenization_electra.ElectraTokenizer'>, <class 'transformers.models.mobilebert.tokenization_mobilebert.MobileBertTokenizer'>, <class 'transformers.models.mpnet.tokenization_mpnet.MPNetTokenizer'>, <class 'transformers.models.roberta.tokenization_roberta.RobertaTokenizer'>, <class 'transformers.models.squeezebert.tokenization_squeezebert.SqueezeBertTokenizer'>, <class 'transformers.models.xlm_roberta.tokenization_xlm_roberta.XLMRobertaTokenizer'>

pquentin commented 4 months ago

Hello! This was already reported in https://github.com/elastic/eland/issues/546. Closing this issue as a duplicate. Thank you.

davidkyle commented 4 months ago

Cross posting from https://github.com/elastic/eland/issues/546#issuecomment-1940920699 for visibility

There are 2 models in Clip, the image processing model and a text embedding model. Elastic does not support image processing models but if you want to use the text embedding model you can install the Sentence Transformers implementation: https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1

eland_import_hub_model --url  https://elastic:q=MYL8TsnSVUhmlwOIWa@localhost:9200 \
 --hub-model-id  sentence-transformers/clip-ViT-B-32-multilingual-v1   \
 --task-type text_embedding  \
 --ca-certs /Users/robinyang/http_ca.crt  \
  --start