elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
635 stars 98 forks source link

Default truncation to `second` for text similarity #713

Closed davidkyle closed 1 month ago

davidkyle commented 1 month ago

NLP models have 3 truncation settings: FIRST, SECOND and NONE

FIRST means truncate the first input. In most cases there is only 1 input (e.g for text embeddings) so this is a sensible default. SECOND means truncate the second input. Task types with 2 inputs are extractive question answering where the question is one input and the context the other. Text Similarity takes has 2 inputs. NONE means don't truncate and window the input.

For text similarity the first input is usually the shorter input, for example it might be the query text in a rerank operation. In this situation it is better to truncate the second input. This change makes that the default.