Fine-tune with data without labels

wyzrxrs commented 4 years ago

Hello, Is it possible to fine-tune the model on a dataset containing only sentences without labels? I would like to find similar sentences using sentence-transformer.

Thanks!

nreimers commented 4 years ago

Hi @wyzrxrs You need some structure / label that encode which sentence pairs should be similar or unsimilar. Learning this from thin-air is obviously not possible, you need to tell somehow the computer what you judge as similar and what not.

Sometimes, you can use some structure in your documents. For example, two sentence on the same topic (from the same doc) should be more similar than two sentence from different documents. Or neighbouring sentences should be more similar than non-neighbouring sentences.

But at the end you need some information what is treated as similar and what not.

Best Nils Reimers

MattesR commented 4 years ago

Since I'm trying to do something similar (heh), I want to specify the question, as I too have a question about the idea of creating a semantic similarity search engine.
I want to build a semantic similarity search engine using embeddings. My understanding of sentence-transformers was, that semantically similar sentences automatically get similar vectors (analogue to "word-transformers") and that fine-tuning a dataset is the same as fine-tuning underlying the transformer (such as BERT). Is this not correct? I want to build the search engine for multiple languages, mainly English and German in the beginning. I understand, that using your pre-trained models for finding similar sentences in a dataset in German will likely yield poor results. The reason for this -I thought- is the fact that the underlying model creates poor word representations in the first place. That's where training a different model comes into place. I just stumbled over this question right here and I wanted to clarify this before I start training a fine-tuned BERT with my data. So isn't (semantic) similarity search with embeddings unlabeled by design, given that the embeddings capture what the try to capture in the first place?

PS: I assume that OP also wants to create a semantic similarity search engine.

nreimers commented 4 years ago

Hi @MattesR 1) the typical way to create sentence embeddings (so that similar sentences are close), is either use labels directly (e.g. hand annotated), or to exploit some structure from which you can derive a notation of similarity. One of the most basic structure you could exploit is for example the distance between sentences in a document: Sentences close to each other, like consecutive sentences, are usually more similar than sentence far away in a document or which are from two different documents.

Luckily you don't have to start from scratch, there are already enough datasets and models available to give you vectors with the desired properties.

2) Yes, BERT / RoBERTa creates rather poor sentence representations out-of-the-box. So you must fine-tune BERT so that it will create nice sentence representations (as described in my paper). The multi-lingual BERT / RoBERTa models also perform out-of-the-box rather badly. For example you have sentences in English and Spanish and want to measure their similarity. XLM-RoBERTa achieves on this dataset from STS2017 a correlation of only 10 points, multilingual-BERT only a correlation of 27 points. I.e., out-of-the-box you cannot use mBERT / XLM-RoBERTa to measure the similarity between an English and a Spanisch sentence.

However, with aligned multi-lingual datasets, for example datasets you use in machine translation, you can force mBERT / XLM-RoBERTa to map sentences from different languages to the same point in vector space.

I currently run experiments on this multilingual setup and with some training data you can improve the performance from 10 / 27 points to over 55 points.

Currently the experiments are in preliminary state. But once I got everything right, I will release multi-lingual models. These models can then also be used for semantic search.

MattesR commented 4 years ago

Hi @nreimers I've reread the paper, as I got the feeling that I have missed something.Classification Objective Function It created further questions. I'd welcome an answer, if you have the time:

When you've trained SBERT with SLNI and MultiNLI, why did you use the softmax-classifier instead of the Triplet Objective Function? I would have assumed that it's more accurate, when you have data classified with three usable labels.
The Classification Objective Function can be used with any number of labels k of my data, correct?
Luckily you don't have to start from scratch, there are already enough datasets and models available to give you vectors with the desired properties.

Don't I? I thought there was no model that gives good sentence embeddings (usable for semantic similarity search) for German (or any non-English) text data. That's why I wanted to fine-tune Sentence-BERT for German data. As I understood, the problem with creating an SBERT-Model for another language is the lack of labeled data. (Now That I got that one need labeled data for the task)
Does it make sense to fine-tune BERT with my data and afterwards fine-tune that already fine-tuned model for sentence-BERT? My thinking is, that starting with a better word representation makes fine-tuning SBERT more accurate.
I currently run experiments on this multilingual setup and with some training data you can improve the performance from 10 / 27 points to over 55 points. What is the structure of that training data, if I may ask. Do you have labeled data, or did you have automatic labeling? How many labels does the training data have?

Thanks in advance for answers!

nreimers commented 4 years ago

Hi,

Triplet loss minimizes the distance between anchor and positive, while maximizing the distance between anchor and negative. It would not make sense to use triplet loss for NLI, as you do not have triplets and no notion of anchor / positive / negative. You could maybe construct it from the entailment and contradiction relation. But I don't know if every hypothesis in NLI has at least one entailment and one contradiction.

2) Correct

3) Oh, yes, for other languages, there are sadly not so many suitable datasets available. So there you need to start somehow from scratch.

4) Can make sense. For German, you can use the German BERT or the multilingual BERT.

5) Currently I train it in a multi-task setup. It will hopefully be included in the next release.

The idea is as following: 1) You have English embeddings with the desired properties, i.e., semantically close pairs are close in vector space. 2) Identical sentences, independent on the language, are mapped to the same point in vector space. I.e., "Hello world" and "Hallo Welt" are mapped to the same point in vector space.

You achieve 1) by training on SNLI or STS data. For 2) you also train on training data for machine translation where you have parallel sentences (e.g. EuroParl or UN corpus).

For 2), you use MultipleNegativeRankingLoss: The sentence and the translation serve as positive pair, all other translations in a batch serve as negative pair.

The fine-tuning of the parameters are a bit tricky, as you now have two training objectives: Translations should be close in vector space + distance should indicate the semantic similarity of pairs.

subhamiitk commented 4 years ago

Hi @wyzrxrs You need some structure / label that encode which sentence pairs should be similar or unsimilar. Learning this from thin-air is obviously not possible, you need to tell somehow the computer what you judge as similar and what not.

Sometimes, you can use some structure in your documents. For example, two sentence on the same topic (from the same doc) should be more similar than two sentence from different documents. Or neighbouring sentences should be more similar than non-neighbouring sentences.

But at the end you need some information what is treated as similar and what not.

Best Nils Reimers

Hi Nils

I understand that we need to pass some structure of similarity measure between two sentences when doing fine-tuning. But let's say I want to incorporate domain knowledge about the corpus(2lakh unlabelled English sentences) that I am using. Can pre-training the language model using run_language_modeling.py be useful? If yes, can I still use the Sentence-Transformer English Pre-Trained Models and then fine-tune it further on my labelled dataset(around 10k sentences)?

nreimers commented 4 years ago

Hi @subham1

I haven't tested the language model fine-tuning. But as far as I know are the results rather limited. Further, you would first perform the language model fine-tuning, then you would need to fine-tune on some data that conveys your sentence similarities.

Best Nils Reimers

boxorange commented 4 years ago

Hi @nreimers

First, thanks for the great tool. I'm trying to find top N most similar sentences of a specific sentence in a scientific document. I have a question about labels. Based on your earlier answers, they play a role to give a hint to machines about what similarity means between sentences. I wonder what if I'm not quite sure what labels would be. I'm thinking to build a new model based on sciBERT, training it using NLI and STS data. And on the top of that, I'm thinking to fine-tune it using my own data, which is a collection of scientific articles. Let's say I have 10 articles, and I want to get top 5 most similar sentences of any sentence in a article. In this case, the labels should be the articles themselves? Or, if labels are not clear, would it be better to just use the model without further training it on my data?

Many thanks in advance!

nreimers commented 4 years ago

Hi @boxorange In that case I think it might be sufficient just to train on NLI + STS.

Otherwise I recommend to have a look at triplet loss. You input could look like: anchor: sentence positive example: A sentence that is similar to your anchor negative example: A sentence that is NOT similar to your anchor

If you have some data likes this or if you can construct it somehow syntactically you can train the model.

Have a look at this paper: https://openreview.net/forum?id=rkg-mA4FDr

There, they constructed a dataset with similar (query-sentence) pairs from Wikipedia by using the structure within wikipedia (sections and links between articles). Maybe you have some similar structure in your data that you could use.

Best Nils Reimers

pistocop commented 4 years ago

Hi @MattesR

I'm very interested in one of your points:

Does it make sense to fine-tune BERT with my data and afterwards fine-tune that already fine-tuned model for sentence-BERT? My thinking is, that starting with a better word representation makes fine-tuning SBERT more accurate

I'm wondering if you have tried this approach, and if yes what was the results archived.

Many thanks in advance for any information.

kasramsh commented 4 years ago

Dear Nils @nreimers ,

Thanks for your great work and open sourcing your codes :)

I have a question and hope I can find my answer here.

Despite others, I actually have problem with implementation of fine-tuning for semantic similarity search (in German). Examples use DataLoaders for using public datasets (like STS Benchmark) for these kind of tunings.

I have a large number of similar and non-similar sentence pairs in German, and would like to fine-tune the model for similarity search myself (from scratch).

How can I feed the model with that data directly?

I thank you in advance and appreciate hints in this regard.

Cheers, Kasra

thistleknot commented 1 year ago

https://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

The method there will do it.

prtkmishra commented 1 year ago

Dear Nils @nreimers , Firstly I would like to thank you for making your code open source and such a great documentation. I wanted to get your thoughts on my approach and understanding if you don't mind. My use case involves creating a NLP based multi-label classification model for custom text dataset (involves system generated logs which are not any human language). One of the experiments I wanted to do is with Sentence Transformer. The approach is based on domain adaptation where I will pre-train a transformer using MLM using unsupervised dataset and then finetune the classification head using the embeddings generated from MLM and labelled data. Based on what I have read below is my understanding:

pre-train a transformer (RoBERTa) from scratch on my custom dataset
Use the RoBERTa checkpoints in sentence_transformer
build dataset (pairs or triplets). This is where I am not sure how to calculate the similarity label for sentence transformer
train the sentence transformer
use SetFit on top of trained sentence transformer

My question is about building the sentence transformer dataset. Here I do not have a label which can be used as a similarity score. Is there a way to train sentence transformer in unsupervised way? and if yes, would it be beneficial in any way?

UKPLab / sentence-transformers

Fine-tune with data without labels #89