elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
635 stars 98 forks source link

Model multilingual-e5-small fails to start #583

Closed joshdevins closed 10 months ago

joshdevins commented 1 year ago

At present, the model is processed and uploaded fine, but when starting the model, it fails:

forward() is missing value for argument 'token_type_ids'.

The model is trained from Multilingual-MiniLM which is a BERT model, but uses the XLM-RoBERTa tokenizer. Since we wrap models based on their architecture and not on the tokenizer type, the BERT model expects input that isn't coming from the XLM-RoBERTa tokenizer. We should consider changing how we decide which wrapper to use (three inputs or two) based on the tokenizer instead.

Note that the base and large model variants work fine because they are XLM-RoBERTa models, and use the corresponding tokenizer.

serenachou commented 12 months ago

@srikanthmanvi @pquentin @technige hey ya'll, wanted to flag that this issue in eland is preventing users from using the e5 small model (easiest to use) in Elasticsearch. It would be amazing if the next version that released for Eland fixed this issue so we can provide support for the e5 small model.

srikanthmanvi commented 12 months ago

@srikanthmanvi @pquentin @technige hey ya'll, wanted to flag that this issue in eland is preventing users from using the e5 small model (easiest to use) in Elasticsearch. It would be amazing if the next version that released for Eland fixed this issue so we can provide support for the e5 small model.

@serenachou thanks for the ping. We will prioritize this.

joshdevins commented 12 months ago

Please note that this applies only to the multilingual E5 model. The normal e5-small-v2 works just fine.

serenachou commented 11 months ago

@srikanthmanvi if this isn't on your radar for 8.12 - we would love this to be included into 8.12, or any earlier version that you and @pquentin are cooking up because after 8.12, we would likely be looking to prepared models, so this work would be less effective as a way to encourage customers to use this model for multilingual use cases

joshdevins commented 11 months ago

Discussed f2f, @davidkyle will have a look and I will support.

ialdencoots commented 9 months ago

Any reason to support intfloat/multilingual-e5-small but not intfloat/multilingual-e5-base or intfloat/multilingual-e5-large? It would be nice if those models worked as well.

ialdencoots commented 9 months ago

Taking a look at the code updates, it seems this fix should affect the base and large models as well. My issue may be unrelated to this then, but I'm experiencing an issue where the first infer request after deployment works for the larger models, but the second request and onward returns a vector of all zeros. multilingual-e5-small works as expected though. Is this likely to be an eland bug or something better addressed in the elasticsearch repo?

davidkyle commented 9 months ago

@ialdencoots thanks for reporting the problem I have reproduced it myself. You can track the issue at https://github.com/elastic/elasticsearch/issues/102541

The bug fix linked above only applies to the small model, the error you are seeing is a different issue.

intfloat/multilingual-e5-small works well Elastic, but note that the E5 models are trained with prefix strings which should be used for information retrieval. See https://huggingface.co/intfloat/multilingual-e5-base#faq. Prefix string support has been added to Elasticsearch in https://github.com/elastic/elasticsearch/pull/102089 and will be available in the next release (8.12)