UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Question about encoding longer texts #1622

Open mina1987 opened 2 years ago

mina1987 commented 2 years ago

I am using 'sentence-transformers/all-mpnet-base-v2'. My question is what happens when I encode a text longer than 384 tokens? Does the model embed sentences in the longer text separately? If so how does the model calculate the final embedding for my text? example:

text="""We evaluate the performance of SBERT for common Semantic Textual Similarity (STS) tasks. State-of-the-art methods often learn a (complex)
regression function that maps sentence embeddings to a similarity score. However, these regression functions work pair-wise and due to the combinatorial explosion those are often not scalable if
the collection of sentences reaches a certain size.
Instead, we always use cosine-similarity to compare the similarity between two sentence embeddings. We ran our experiments also with negative Manhatten and negative Euclidean distances
as similarity measures, but the results for all approaches remained roughly the same."""

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

embedding = model.encode(text)
nreimers commented 2 years ago

Any text longer is truncated. All text is passed as single input

mina1987 commented 2 years ago

Thanks! So it takes the first 384 MPNet tokens and encode.

mina1987 commented 2 years ago

Any idea of a more suitable model than MPNet for encoding multi_sentence text?