elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Apache License 2.0
628 stars 98 forks source link

Add model JinaBertForMaskedLM to the supported list. #628

Open Cris-Maggi opened 8 months ago

Cris-Maggi commented 8 months ago

Add the new BERT based model JinaBertForMaskedLM to the supported list as it's being requested by customers. Link to the model page. https://huggingface.co/jinaai/jina-embeddings-v2-base-en. CLI used to import

**docker run -it --rm elastic/eland eland_import_hub_model --url https://elastic:password@elasticsearchlink/ --hub-model-id jinaai/jina-embeddings-v2-base-en --task-type text_embedding --start**

Error observed

Traceback (most recent call last):
File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 241, in main
tm = TransformerModel(
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 642, in __init__
self._config = self._create_config(es_version)
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 735, in _create_config
tokenization_config = self._create_tokenization_config()
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 673, in _create_tokenization_config
_max_sequence_length = self._find_max_sequence_length()
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 730, in _find_max_sequence_length
raise ValueError("Cannot determine model max input length")
steveedcast commented 8 months ago

I also have a need to use Jina for long input sequences. Any idea when Elastic would support this in eland?