facebookresearch / InferSent

InferSent sentence embeddings
Other
2.28k stars 470 forks source link

Interesting sentence similarity scores #54

Closed ajayrfhp closed 6 years ago

ajayrfhp commented 6 years ago
import nltk
import torch
import numpy as np

sentence_model = torch.load('infersent.allnli.pickle')
GLOVE_PATH = '../dataset/GloVe/glove.840B.300d.txt'
print('loaded infersent')

sentence_model.set_glove_path(GLOVE_PATH)
sentence_model.build_vocab_k_words(K=100000)

def similarity(sentence_model, s1, s2):
    v1 = sentence_embed(sentence_model, s1)
    v2 = sentence_embed(sentence_model, s2)
    return cosine(v1, v2)

def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

def sentence_embed(sentence_model, sentence):
    return sentence_model.encode([sentence])[0]

similarity(sentence_model, "I do not like you", "I love you") # => 0.66

Wondering if there is bug in my code or this score is expected.

gitathrun commented 6 years ago

Looks like true, especially in couple's conversation.

aconneau commented 6 years ago

This sounds reasonable to me. Note that the classifier is not learned on top of just the cosine similarity but on top of [u, v, |u-v|, u*v], so while the cosine similarity is high for these contradictory sentences, it doesn't mean that the classifier will not be able to classify those two sentences as being contradictory.

davidjb99 commented 6 years ago

I've run some tests comparing InferSent to an average of the sum of words in a sentence using w2v and would say InferSent is superior. Although it is also far slower than w2v.

"I do not like you", "I love you"

w2V: 0.908767
InferSent : 0.668501
ksashok commented 5 years ago

The same code with the new infersent1.pkl file returns cosine similarity as 1.0. Can someone please confirm and let me know what should i do to correct it?