FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.79k stars 483 forks source link

how to handle negation case #244

Open bugface opened 10 months ago

bugface commented 10 months ago

Hi FlagEmbedding team,

Recently, I am trying to check if bge embeddings can be used to do text deduplication. One problem, I have noticed is for two sentences one is negated to the other, the similarity is still high. I know this is not a new problem. People already discussed a lot : https://discuss.huggingface.co/t/sentence-similarity-models-not-capturing-opposite-sentences/10388.

My question is

  1. Do the team have a plan to solve such problems
  2. Can we try model to get similarity in range(-1, 1) so we can represent opposite semantic meaning
  3. Do you think sentiment analysis + semantic is a good way to solve this problem.

Thanks.

See test code below.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-base-en')

sentences_1 = ["Have you had a heart attack within the last year?", "Have you had a heart attack within the last year?"]
sentences_2 = ["Do you have a heart attack within the last year?", "Do you not have a heart attack within the last year?"]

embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(np.diagonal(similarity, axis1=0))

# [0.98755527, 0.9616535 ]
staoxiao commented 10 months ago
  1. Sorry, we have no plan to solve this problem.
  2. You can try to use version 1.5 which has a more reasonable similarity distribution.
  3. Maybe. We haven't done this, but it seems to be a possible solution.