elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
628 stars 98 forks source link

[NLP] Tests for NLP model configurations #623

Closed davidkyle closed 9 months ago

davidkyle commented 9 months ago

The configurations generated for the models uploaded to Elasticsearch are derived from transformer models config, this test asserts on various settings in the generated configurations.

A new function is added for finding the model's max_sequence_length as the value is found in different places. Specifically the previous method of looking up the model in the tokenizer's max_model_input_sizes map sometimes returns None

getattr(self._tokenizer, "max_model_input_sizes", dict()).get(self._model_id)

If the above returns None then the tokenizer's model_max_length property is checked and if that is None the model configuration is checked.

If max_sequence_length cannot be found an exception is thrown. This changes the behaviour as previously it was left unset and Elasticsearch would pick a default value of 512, which is good because the model gets uploaded but bad because the error is silently ignored and may result in an incorrect setting.

One suggestion is to allow the user to manually specify a max_sequence_length in the case where it cannot be found.