UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.12k stars 2.46k forks source link

Cross-Encoder max-length=512 is for both Question or Answer together or individual? #1025

Open KailashDN opened 3 years ago

KailashDN commented 3 years ago

Hi, I am using a cross-encoder in question-answer re-ranking. In cross-encoder, we pass question and answer together. So both need to be in the token limit of 512 or we create embedding separately for Qn and Ans, having a token limit of 512 each?

My understanding, <[CLS] question [SEP] answer [PAD][PAD]....[SEP]> or my case <[CLS] question [SEP] topic title [SEP] answer [PAD][PAD]....[SEP]>

in both should be within 512? Is there is a way to have an encoding of more than 512 apart from the sliding window approach?

Thank you.

nreimers commented 3 years ago

The combined input length can just be 512 word pieces.

Encoding more is only possible with models that are trained for more than 512 word pieces, but they have than e.g. a limit at 1024 or 4096.

Or you use sliding window approach

seahrh commented 3 years ago

What is the max_seq_length that was used to train models that produced the results shown in the ce benchmark?

The max_seq_length was indicated for sentence embeddings:

The following models have been tuned to embed sentences and short paragraphs up to a length of 128 word pieces.
nreimers commented 3 years ago

@seahrh The Cross-Encoders where train with 512 word pieces: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_cross-encoder-v2.py