UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.69k stars 2.42k forks source link

Fine-tune on my own custom dataset #1303

Open ruinianxu opened 2 years ago

ruinianxu commented 2 years ago

Hi, firstly thank you so much for sharing us with such awesome works. I am trying to train semantic textual similarity on my own dataset which includes sentence pairs of robotic task descriptions. For example, sentence 1 is grasp the knife and cut the banana while sentence 2 is grasp the knife and slice the banana.

The issue I am facing right now is that both pearson and spearman values are close to 0 after training. I tried several ways i.e., adjusting the learning rate, but none of them helped. Really appreciate it if you could provide me with some directions to explore.

nreimers commented 2 years ago

I would assume that you pass the training data wrongly, or use a wrong loss function, or that the labeled data does not match what you try to do

ruinianxu commented 2 years ago

@nreimers Thank you so much for your quick response and I will go to take a look at it.

ruinianxu commented 2 years ago

@nreimers Here is the code that I used for training which is mainly based on train_stsbenchmark.py. They are mostly the same but I add a small variance to the score of the sentence pair otherwise spearman and pearson will be nan. I can't find anywhere suspicious.

I also attach a small samples of my data. My purpose is to force the network to extract the similar textual embedding if this sentence refers to a specific type of robotic task, i.e., cutting or cooking and etc. The objects appear in the sentence pair can be different. Do you think this can be the reason that network can't learn very well? Thanks you so much for your help.

from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer,  LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample
import logging
from datetime import datetime
import sys
import os
import gzip
import csv
import numpy as np

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#Check if dataset exsist. If not, download and extract  it
dataset_path = '/home/ruinian/IVALab/Project/TaskGrounding/sentence-transformers/datasets/gt_rt_exonly_small_sample.csv'

#You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
model_name = sys.argv[1] if len(sys.argv) > 1 else 'distilbert-base-uncased'

# Read the dataset
train_batch_size = 16
num_epochs = 4
model_save_path = 'output/training_stsbenchmark_'+model_name.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Convert the dataset to a DataLoader ready for training
logging.info("Read GT RT train dataset")

train_samples = []
dev_samples = []
test_samples = []
with open(dataset_path, newline='') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        label = np.random.uniform(low=0.99, high=1.0)
        inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=label)

        if row['split'] == 'dev':
            dev_samples.append(inp_example)
        elif row['split'] == 'test':
            test_samples.append(inp_example)
        else:
            train_samples.append(inp_example)

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

logging.info("Read GT RT dev dataset")
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

# Configure the training. We skip evaluation in this example
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))

# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=5000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

##############################################################################
#
# Load the stored model and evaluate its performance on STS benchmark dataset
#
##############################################################################
model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path=model_save_path)
Sentence 1 is Use the knife to slice the potato.
Sentence 2 is Cut the lettuce.
Sentence 1 is Carve the apple.
Sentence 2 is Cut the potato.
Sentence 1 is Pick the knife and slice the potato.
Sentence 2 is Grasp the knife and carve the tomato.
Sentence 1 is Slash the apple.
Sentence 2 is Grasp the butterknife and slash the lettuce.
Sentence 1 is Grab the butterknife and carve the tomato.
Sentence 2 is Catch the knife and slash the bread.
Sentence 1 is Grasp the butterknife and carve the lettuce.
Sentence 2 is Slice the bread by the knife.
nreimers commented 2 years ago

You just have positive pairs. For cosine similarity loss you also need negative pairs and pairs with labels in between.

Check the MultipleNegativesRankingLoss, it might be more suitable for your task

ruinianxu commented 2 years ago

@nreimers I also observed this issue. Thank you so much for your suggestion and I will take a look at it.

ruinianxu commented 2 years ago

@nreimers Sorry to keep bothering you. Thank you so much for your previous suggestion and positive and negative samples did work. I am still using CosineSimilarityLoss since MultipleNegativesRankingLoss requires all other i-1 sentences are negative to i-th sentence, which is not my case.

My current issue is that no matter how I change the learning rate, the performance of spearman stays the same starting from the first epoch. Pearson changes a little bit. I am not sure if it is normal and where the problem is if not. Could you give me some directions? Thank you so much for your precious time and kind help.

Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4688/4688 [05:52<00:00, 13.29it/s]
2021-12-08 16:07:40 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 0:                                                                                                  
2021-12-08 16:07:58 - Cosine-Similarity :       Pearson: 1.0000 Spearman: 0.8165                                                                                                                            
2021-12-08 16:07:58 - Manhattan-Distance:       Pearson: 0.9989 Spearman: 0.8165                                                                                                                            
2021-12-08 16:07:58 - Euclidean-Distance:       Pearson: 0.9990 Spearman: 0.8165                                                                                                                            
2021-12-08 16:07:58 - Dot-Product-Similarity:   Pearson: 0.9955 Spearman: 0.8165                                                                                                                            
2021-12-08 16:07:58 - Save model to output/training_gtrt_exim_distilbert-base-uncased-2021-12-08_16-01-43                                                                                                   
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4688/4688 [06:00<00:00, 13.00it/s]
2021-12-08 16:13:59 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 1:                                                                                                  
2021-12-08 16:14:16 - Cosine-Similarity :       Pearson: 1.0000 Spearman: 0.8165                                                                                                                            
2021-12-08 16:14:16 - Manhattan-Distance:       Pearson: 0.9900 Spearman: 0.8165                                                                                                                            
2021-12-08 16:14:16 - Euclidean-Distance:       Pearson: 0.9924 Spearman: 0.8165                                                                                                                            
2021-12-08 16:14:16 - Dot-Product-Similarity:   Pearson: 0.9655 Spearman: 0.8165                                                                                                                            
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4688/4688 [05:53<00:00, 13.27it/s]
2021-12-08 16:20:09 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 2:                                                                                                  
2021-12-08 16:20:26 - Cosine-Similarity :       Pearson: 1.0000 Spearman: 0.8165                                                                                                                            
2021-12-08 16:20:26 - Manhattan-Distance:       Pearson: 0.9916 Spearman: 0.8165                                                                                                                            
2021-12-08 16:20:26 - Euclidean-Distance:       Pearson: 0.9933 Spearman: 0.8165                                                                                                                            
2021-12-08 16:20:26 - Dot-Product-Similarity:   Pearson: 0.9680 Spearman: 0.8165                                                                                                                            
2021-12-08 16:20:26 - Save model to output/training_gtrt_exim_distilbert-base-uncased-2021-12-08_16-01-43                                                                                                   
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4688/4688 [05:59<00:00, 13.06it/s]
2021-12-08 16:26:27 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 3:                                                                                                  
2021-12-08 16:26:45 - Cosine-Similarity :       Pearson: 1.0000 Spearman: 0.8165                                                                                                                            
2021-12-08 16:26:45 - Manhattan-Distance:       Pearson: 0.9911 Spearman: 0.8165                                                                                                                            
2021-12-08 16:26:45 - Euclidean-Distance:       Pearson: 0.9928 Spearman: 0.8165                                                                                                                            
2021-12-08 16:26:45 - Dot-Product-Similarity:   Pearson: 0.9656 Spearman: 0.8165                                                                                                                            
2021-12-08 16:26:45 - Save model to output/training_gtrt_exim_distilbert-base-uncased-2021-12-08_16-01-43                                                                                                   
Epoch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [25:02<00:00, 375.58s/it]
2021-12-08 16:26:50 - Load pretrained SentenceTransformer: output/training_gtrt_exim_distilbert-base-uncased-2021-12-08_16-01-43
2021-12-08 16:26:50 - Use pytorch device: cuda
2021-12-08 16:26:50 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2021-12-08 16:27:02 - Cosine-Similarity :   Pearson: 1.0000 Spearman: 0.8660
2021-12-08 16:27:02 - Manhattan-Distance:   Pearson: 0.9940 Spearman: 0.8660
2021-12-08 16:27:02 - Euclidean-Distance:   Pearson: 0.9952 Spearman: 0.8660
2021-12-08 16:27:02 - Dot-Product-Similarity:   Pearson: 0.9549 Spearman: 0.8660
ruinianxu commented 2 years ago

I even tried different generated datasets but the spearman's value still stays the same. I am really confused. Really appreciate it if you could give me some hints.

nreimers commented 2 years ago

Pearson correlation of 1 indicates that your evaluation data is trivial, i.e. it is trivial to estimate the label for a given pair

ruinianxu commented 2 years ago

@nreimers Thank you so much for your hints. First of all, let me briefly introduce the format of my dataset. The dataset contains task descriptions for several robotic tasks, i.e., cutting. Each task description includes at most two objects (subject and object) and one action. The task description can be as explicit as grasp the knife and cut the banana or as implicit as I want sliced banana. I try to force the network to predict similar embedding for explicit and implicit descriptions. All descriptions are generated based on a list of templates and only objects mentioned in the template are replaced with different names.

My previous dataset assigns the highest score to each pair of sentences belong to the same robotic task and zero to any other pairs, which leads to the trivial dataset I think. What I tried today is to rank sentence pairs into different levels. For example, if both objects mentioned in the description match, it will be assigned 5.0. If only partial objects are matched, it will be assigned less score. It did improve a little bit but not too much. As shown in the result, pearson correlation is still close to 1.

My current plan is to use some paraphrasers to add an amount of uncertainty to sentences.

I wonder if there is any other plans to improve my dataset. Like AugmentedSBert? Really appreciate your suggestions.

Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4995/4995 [06:05<00:00, 13.67it/s]
2021-12-09 18:01:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 0:                                                                                                  
2021-12-09 18:01:59 - Cosine-Similarity :       Pearson: 0.9950 Spearman: 0.8891                                                                                                                            
2021-12-09 18:01:59 - Manhattan-Distance:       Pearson: 0.9540 Spearman: 0.8891                                                                                                                            
2021-12-09 18:01:59 - Euclidean-Distance:       Pearson: 0.9541 Spearman: 0.8891                                                                                                                            
2021-12-09 18:01:59 - Dot-Product-Similarity:   Pearson: 0.9893 Spearman: 0.8891                                                                                                                            
2021-12-09 18:01:59 - Save model to output/training_gtrt_exim_distilbert-base-uncased-2021-12-09_17-55-38                                                                                                   
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4995/4995 [06:10<00:00, 13.50it/s]
2021-12-09 18:08:10 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 1:                                                                                                  
2021-12-09 18:08:21 - Cosine-Similarity :       Pearson: 0.9980 Spearman: 0.8891                                                                                                                            
2021-12-09 18:08:21 - Manhattan-Distance:       Pearson: 0.9505 Spearman: 0.8891                                                                                                                            
2021-12-09 18:08:21 - Euclidean-Distance:       Pearson: 0.9500 Spearman: 0.8891                                                                                                                            
2021-12-09 18:08:21 - Dot-Product-Similarity:   Pearson: 0.9939 Spearman: 0.8891                                                                                                                            
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4995/4995 [06:09<00:00, 13.51it/s]
2021-12-09 18:14:31 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 2:                                                                                                  
2021-12-09 18:14:42 - Cosine-Similarity :       Pearson: 0.9991 Spearman: 0.8891                                                                                                                            
2021-12-09 18:14:42 - Manhattan-Distance:       Pearson: 0.9502 Spearman: 0.8891                                                                                                                            
2021-12-09 18:14:42 - Euclidean-Distance:       Pearson: 0.9498 Spearman: 0.8891                                                                                                                            
2021-12-09 18:14:42 - Dot-Product-Similarity:   Pearson: 0.9950 Spearman: 0.8891                                                                                                                            
2021-12-09 18:14:42 - Save model to output/training_gtrt_exim_distilbert-base-uncased-2021-12-09_17-55-38                                                                                                   
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4995/4995 [06:10<00:00, 13.50it/s]
2021-12-09 18:20:54 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset after epoch 3:                                                                                                  
2021-12-09 18:21:05 - Cosine-Similarity :       Pearson: 0.9994 Spearman: 0.8891                                                                                                                            
2021-12-09 18:21:05 - Manhattan-Distance:       Pearson: 0.9490 Spearman: 0.8891                                                                                                                            
2021-12-09 18:21:05 - Euclidean-Distance:       Pearson: 0.9485 Spearman: 0.8891                                                                                                                            
2021-12-09 18:21:05 - Dot-Product-Similarity:   Pearson: 0.9958 Spearman: 0.8891                                                                                                                            
2021-12-09 18:21:05 - Save model to output/training_gtrt_exim_distilbert-base-uncased-2021-12-09_17-55-38                                                                                                   
Epoch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [25:28<00:00, 382.03s/it]
2021-12-09 18:21:11 - Load pretrained SentenceTransformer: output/training_gtrt_exim_distilbert-base-uncased-2021-12-09_17-55-38
2021-12-09 18:21:11 - Use pytorch device: cuda
2021-12-09 18:21:11 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2021-12-09 18:21:22 - Cosine-Similarity :   Pearson: 0.9956 Spearman: 0.8954
2021-12-09 18:21:22 - Manhattan-Distance:   Pearson: 0.9464 Spearman: 0.8954
2021-12-09 18:21:22 - Euclidean-Distance:   Pearson: 0.9457 Spearman: 0.8954
2021-12-09 18:21:22 - Dot-Product-Similarity:   Pearson: 0.9909 Spearman: 0.8937