elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
628 stars 98 forks source link

Add model JinaBertForMaskedLM to the supported list. #628

Open Cris-Maggi opened 8 months ago

Cris-Maggi commented 8 months ago

Add the new BERT based model JinaBertForMaskedLM to the supported list as it's being requested by customers. Link to the model page. https://huggingface.co/jinaai/jina-embeddings-v2-base-en. CLI used to import

**docker run -it --rm elastic/eland eland_import_hub_model --url https://elastic:password@elasticsearchlink/ --hub-model-id jinaai/jina-embeddings-v2-base-en --task-type text_embedding --start**

Error observed

Traceback (most recent call last):
File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 241, in main
tm = TransformerModel(
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 642, in __init__
self._config = self._create_config(es_version)
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 735, in _create_config
tokenization_config = self._create_tokenization_config()
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 673, in _create_tokenization_config
_max_sequence_length = self._find_max_sequence_length()
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 730, in _find_max_sequence_length
raise ValueError("Cannot determine model max input length")
steveedcast commented 8 months ago

I also have a need to use Jina for long input sequences. Any idea when Elastic would support this in eland?