Does the cosine similarity prediction score should be closed to the label normalized value.

aleversn commented 3 years ago

Hi, I am working on scoring subjective answers. And now I'm wondering if the textual similarity score should be closed to the label normalized value？

I try this work on the STS dataset and here is my example:

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

class DataPre():

    def __init__(self, file_name, padding_length=128, shuffle=True):
        self.padding_length = padding_length
        self.ori_list = self.load_train(file_name)
        if shuffle:
            random.shuffle(self.ori_list)

    def load_train(self, file_name):
        with open(file_name, encoding='utf-8') as f:
            ori_list = f.read().split('\n')
        if ori_list[len(ori_list) - 1] == '':
            ori_list = ori_list[:len(ori_list) - 1]
        return ori_list

    def __getitem__(self, idx):
        line = self.ori_list[idx]
        line = line.strip().split('\t')
        s1, s2, label = line[5], line[6], line[4]
        return InputExample(texts=[s1, s2], label=float(label) / 5)

    def __len__(self):
        return len(self.ori_list)

mydata = DataPre('./dataset/stsbenchmark/sts-train.csv')
mydata_eval = DataPre('./dataset/stsbenchmark/sts-dev.csv')

e_s1 = []
e_s2 = []
e_score = []
for i in tqdm(range(mydata_eval.__len__())):
    example = mydata_eval.__getitem__(i)
    e_s1.append(example.texts[0])
    e_s2.append(example.texts[1])
    e_score.append(example.label)

train_examples = [mydata.__getitem__(i) for i in tqdm(range(mydata.__len__()))]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=64)

evaluator = EmbeddingSimilarityEvaluator(e_s1, e_s2, e_score, show_progress_bar=True)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], evaluator=evaluator, epochs=4, warmup_steps=100, output_path='./log/sbert')

# %%
from sentence_transformers import util
#Compute embedding for both lists
embeddings1 = model.encode(e_s1, convert_to_tensor=True)
embeddings2 = model.encode(e_s2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
average_distance = torch.abs(cosine_scores - torch.tensor(e_score).cuda()).mean()

average_distance equals to the average absolute distance between prediction scores and the real scores.

I've normalized the label score into 0 ... 1 and it looks like the average_distance is rather high(about 0.4) no matter what I used in fine-tuning or directly use the pre-trained model. Can the model be used to scoring the answers or were there any mistakes I've made?

nreimers commented 3 years ago

And now I'm wondering if the textual similarity score should be closed to the label normalized value？

Usually they should become close. But depending how your data looks like, this is not necessarily possible.

aleversn commented 3 years ago

Can I understand that in this way: the model can calculate how similar the two sentences are, but not necessarily close to the label value?

nreimers commented 3 years ago

Yes, if you have many extreme values (like either 0 or 1), the model will also learn values in between.

For STSb, you can take the pre-trained models here and compute their distance to the gold label.

aleversn commented 3 years ago

Right now I'm testing STS-benchmark dev set using the bert-large-nli-stsb-mean-tokens pre-trained model and the distance is computed by

average_distance = torch.abs(cosine_scores - torch.tensor(e_score).cuda()).mean()

where the cosine_scores are prediction scores and the e_score are label normalized values, the result of average_distance is 0.43, it seems the distance is a little bit high, is there another way to recover the scores correctly closed to the label value?

UKPLab / sentence-transformers

Does the cosine similarity prediction score should be closed to the label normalized value. #506