Improve compatibility testing of supported NLP models

Today we rely mostly on unit testing for the PyTorch/NLP model import testing. We perform large scale testing as part of other components like Elasticsearch, but we often find bugs later only and can't tie them to specific changes in eland (e.g. to a specific PR). We'd like to improve integration testing in eland by performing a test matrix of models+multiple Elasticsearch versions. For each model, we'd test multiple inputs of various lengths, up to and beyond each model's input limit, and validate the inference results from Elasticsearch directly against results from transformers as ground truth. Tests should run as part of the normal CI cycle and need to pass before a PR can be merged.

More details to follow in this issue such as the list of models to test.

The following is a list of models that we wish to verify compatibility with, per-task type. The list is based off of the base models and tokenizers that we support, and the tasks we support.

fill_mask
- bert-base-uncased
- distilbert-base-uncased
- distilroberta-base
- roberta-base
- xlm-roberta-base
- cl-tohoku/bert-base-japanese
- google/electra-base-discriminator
- google/mobilebert-uncased
- facebook/bart-base
- microsoft/mpnet-base
- squeezebert/squeezebert-uncased
ner
- dbmdz/bert-large-cased-finetuned-conll03-english
- dslim/bert-base-NER
- elastic/distilbert-base-cased-finetuned-conll03-english
- elastic/distilbert-base-uncased-finetuned-conll03-english
text_expansion
text_classification
- distilbert-base-uncased-finetuned-sst-2-english
- SamLowe/roberta-base-go_emotions
- cardiffnlp/twitter-roberta-base-irony
- ProsusAI/finbert
- j-hartmann/emotion-english-distilroberta-base
- cardiffnlp/twitter-roberta-base-sentiment-latest
- roberta-large-mnli
- huggingface/distilbert-base-uncased-finetuned-mnli
text_embedding
- intfloat/e5-small-v2
- intfloat/e5-base-v2
- intfloat/e5-large-v2
- intfloat/multilingual-e5-small
- intfloat/multilingual-e5-base
- intfloat/multilingual-e5-large
- BAAI/bge-small-en-v1.5
- BAAI/bge-base-en-v1.5
- BAAI/bge-large-en-v1.5
- thenlper/gte-small
- thenlper/gte-base
- thenlper/gte-large
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/all-MiniLM-L12-v2
- sentence-transformers/all-distilroberta-v1
- sentence-transformers/multi-qa-MiniLM-L6-cos-v1
- sentence-transformers/multi-qa-mpnet-base-dot-v1
- sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-MiniLM-L3-v2
- sentence-transformers/paraphrase-MiniLM-L6-v2
- sentence-transformers/paraphrase-mpnet-base-v2
- sentence-transformers/distiluse-base-multilingual-cased-v2
- sentence-transformers/msmarco-distilbert-base-tas-b
- sentence-transformers/msmarco-MiniLM-L12-cos-v5
- sentence-transformers/LaBSE
- sentence-transformers/facebook-dpr-ctx_encoder-single-nq-base
- sentence-transformers/facebook-dpr-question_encoder-single-nq-base
- sentence-transformers/facebook-dpr-ctx_encoder-multiset-base
- sentence-transformers/facebook-dpr-question_encoder-multiset-base
- facebook/dpr-ctx_encoder-single-nq-base
- facebook/dpr-question_encoder-single-nq-base
- facebook/dpr-ctx_encoder-multiset-base
- facebook/dpr-question_encoder-multiset-base
zero_shot_classification
- facebook/bart-large-mnli
- vicgalle/xlm-roberta-large-xnli-anli
- valhalla/distilbart-mnli-12-1
- typeform/distilbert-base-uncased-mnli
question_answering
- deepset/roberta-base-squad2
- distilbert-base-uncased-distilled-squad
- deepset/bert-large-uncased-whole-word-masking-squad2
text_similarity
- cross-encoder/ms-marco-MiniLM-L-6-v2
- cross-encoder/ms-marco-TinyBERT-L-2-v2

elastic / eland

Improve compatibility testing of supported NLP models #622