> Is there a specific technical reason for this limitation, or is it more a matter of these models not being as widely adopted or supported?
I believe it's the latter: a decoder is pretty separate from the encoder, so it should essentially always be possible to add.
do you have any general advice for someone working on fine-tuning models for both STS and retrieval tasks? are there any common pitfalls I should watch out for, or any resources you'd recommend for optimizing performance across these different tasks?
I believe most of the large model authors use "query prefixing" for the retrieval query texts, i.e. they add some prompt like query:, Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:, Represent this sentence for searching relevant passages: like here, here or here.
Usually, the passages are not prefixed, and the STS texts are not prefixed. The idea is that with STS, texts with similar meanings are pushed together, but with retrieval, a text that answers a question is pushed together. These can contrast, as e.g. "Who founded Apple?" and "Steve Jobs, Steve Wozniak, and Ronald Wayne" are not semantically similar, but they should be close for retrieval. Prefixing the query (e.g. "query: Who founded Apple?" (in theory) should be placed near "Steve Jobs, Steve Wozniak, and Ronald Wayne", while "Who founded Apple?" might be placed near "Who founded NVIDIA?" and "Who founded Facebook?" via the STS learning.
One other thing: For STS learning, you can also adopt MultipleNegativesSymmetricRankingLoss. This is like the "normal" MultipleNegativesRankingLoss with in-batch negatives, but given (anchor, positive) pairs, MNRL only improves "given the anchor, find the positive", whereas the Symmetric variant also improves "given the positive, find the anchor". Because STS is a symmetric task, it can make sense to also train like it, and use all other anchors as the "in-batch negatives".
The only downside is that there's no Cached variant of this loss, nor a GIST variant of this loss.
I believe it's the latter: a decoder is pretty separate from the encoder, so it should essentially always be possible to add.
I believe most of the large model authors use "query prefixing" for the retrieval query texts, i.e. they add some prompt like
query:
,Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:
,Represent this sentence for searching relevant passages:
like here, here or here.Usually, the passages are not prefixed, and the STS texts are not prefixed. The idea is that with STS, texts with similar meanings are pushed together, but with retrieval, a text that answers a question is pushed together. These can contrast, as e.g. "Who founded Apple?" and "Steve Jobs, Steve Wozniak, and Ronald Wayne" are not semantically similar, but they should be close for retrieval. Prefixing the query (e.g. "query: Who founded Apple?" (in theory) should be placed near "Steve Jobs, Steve Wozniak, and Ronald Wayne", while "Who founded Apple?" might be placed near "Who founded NVIDIA?" and "Who founded Facebook?" via the STS learning.
One other thing: For STS learning, you can also adopt MultipleNegativesSymmetricRankingLoss. This is like the "normal" MultipleNegativesRankingLoss with in-batch negatives, but given (anchor, positive) pairs, MNRL only improves "given the anchor, find the positive", whereas the Symmetric variant also improves "given the positive, find the anchor". Because STS is a symmetric task, it can make sense to also train like it, and use all other anchors as the "in-batch negatives". The only downside is that there's no Cached variant of this loss, nor a GIST variant of this loss.
Originally posted by @tomaarsen in https://github.com/UKPLab/sentence-transformers/issues/2771#issuecomment-2185859272