SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard
https://arxiv.org/abs/2309.12871
MIT License
398 stars 30 forks source link

UAE - explanation of Non-Retrieval vs Retrieval #8

Closed legolego closed 7 months ago

legolego commented 7 months ago

Hello, could you please add a little explanation of the difference Non-Retrieval and Retrieval tasks for UAE? Why would one be used instead of another? I'm looking to create sentence embeddings to store in a database. Thank you!

SeanLee97 commented 7 months ago

Hi @legolego , thanks for following our work.

In UAE, we use different approaches for retrieval and non-retrieval tasks, each serving a different purpose. Retrieval tasks aim to find relevant documents, and as a result, the related documents may not have strict semantic similarities to each other.

For instance, when querying "How about chatgpt?", the related documents should contain information pertaining to "chatgpt," such as "chatgpt is amazing..." or "chatgpt is bad....".

Conversely, non-retrieval tasks, such as semantic textual similarity, require sentences that are semantically similar. For example, a sentence semantically similar to "How about chatgpt?" could be "What is your opinion about chatgpt?".

To distinguish between these two types of tasks, we use different prompts. For retrieval tasks, we use the prompt "Represent this sentence for searching relevant passages: {text}" (Prompts.C in angle_emb). For non-retrieval tasks, we set the prompt to empty, i.e., just input your text without specifying a prompt.

So, if your scenario is retrieval-related, it is highly recommended to set the prompt with angle.set_prompt(prompt=Prompts.C). If not, leave the prompt empty or use angle.set_prompt(prompt=None).

legolego commented 7 months ago

Thank you for replying, that makes it more clear. I would like to experiment with addition and subtraction of sentence embeddings something like KING - MAN + WOMAN = QUEEN, but for combinations of ideas in sentences. The goal would be to find sentences similar in meaning to the result of the arithmetic. Would this be a non-retrieval task because semantic similarity is important?

SeanLee97 commented 7 months ago

Thank you for replying, that makes it more clear. I would like to experiment with addition and subtraction of sentence embeddings something like KING - MAN + WOMAN = QUEEN, but for combinations of ideas in sentences. The goal would be to find sentences similar in meaning to the result of the arithmetic. Would this be a non-retrieval task because semantic similarity is important?

@legolego It is a good idea and very interesting! You can try the non-retrieval embedding. If the performance is less than expected, you can fine-tune the model on arithmetic datasets. We have provided a friendly interface to fine-tune the model. Because our pretraining set does not include arithmetic datasets, we cannot ensure good performance on arithmetic similarity.

legolego commented 7 months ago

Thank you for confirming! Do you have an example of an semantic arithmetic dataset like that? I've never heard of one like that. Searching in Google gave me results about mathematics with numbers, but not arithmetic with the ideas in sentences.

SeanLee97 commented 7 months ago

Thank you for confirming! Do you have an example of an semantic arithmetic dataset like that? I've never heard of one like that. Searching in Google gave me results about mathematics with numbers, but not arithmetic with the ideas in sentences.

Sorry, I do not know much about arithmetic semantics; I mainly focus on textual similarity.

legolego commented 7 months ago

Thank you for your answers!

pedrojrv commented 5 months ago

Hi! I know this is close but related to non-retrieval vs retrieval, how was this handled during training? When providing positive and negative paris did you added the prompt.C at some point? Thank you beforehand.