[NLP] Tests for NLP model configurations

The configurations generated for the models uploaded to Elasticsearch are derived from transformer models config, this test asserts on various settings in the generated configurations.

A new function is added for finding the model's max_sequence_length as the value is found in different places. Specifically the previous method of looking up the model in the tokenizer's max_model_input_sizes map sometimes returns None

getattr(self._tokenizer, "max_model_input_sizes", dict()).get(self._model_id)

If the above returns None then the tokenizer's model_max_length property is checked and if that is None the model configuration is checked.

If max_sequence_length cannot be found an exception is thrown. This changes the behaviour as previously it was left unset and Elasticsearch would pick a default value of 512, which is good because the model gets uploaded but bad because the error is silently ignored and may result in an incorrect setting.

One suggestion is to allow the user to manually specify a max_sequence_length in the case where it cannot be found.

elastic / eland

[NLP] Tests for NLP model configurations #623