elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
628 stars 98 forks source link

[NLP] Support E5 small multi-lingual #625

Closed davidkyle closed 8 months ago

davidkyle commented 9 months ago

E5 small multi lingual is based on a BERT architecture which made the code that traces the model assume the model takes 4 parameters to its forward() function. However, the model uses the XLMRoBERTa tokenizer which produces 2 inputs, this lead to an error when evaluating the model as it complained of missing arguments.

The fix is to use the tokenizer type to determine the number of inputs to the model's forward() function

Closes #583

joshdevins commented 9 months ago

Does this close #583 or is another PR pending?

davidkyle commented 9 months ago

Does this close https://github.com/elastic/eland/issues/583 or is another PR pending?

One more PR queued up to add some testing but we can consider #583 closed after this