UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.38k stars 2.39k forks source link

Augmented SBERT training #553

Open langineer opened 3 years ago

langineer commented 3 years ago

confused with Augmented SBERT training procedure in train_sts_indomain_semantic.py: Is it possible to skip first step completely (training a cross-encoder (BERT) model from scratch for small STS benchmark dataset) and use already fine-tuned model like ‘xlm-r-distilroberta-base-paraphrase-v1’ to fit my data? More concretely, skip step 1 and do step 2 as it is done in train_sts_indomain_semantic.py, then at step 3 use ‘xlm-r-distilroberta-base-paraphrase-v1’ as bi-encoder.

thakur-nandan commented 3 years ago

Hi @langineer ,

The reason for training cross-encoder in the first step is so that we can label our silver pairs better, and also in-domain.

In theory, If you wish to skip step-1, you would require another model to label your silver pairs at the end of step-2. I'm unsure which model you plan to choose for this step then.

Kind Regards, Nandan Thakur

langineer commented 3 years ago

Hi @NThakur20,

Thanks for a quick reply and great paper and idea of Augmented SBERT.

you would require another model to label your silver pairs at the end of step-2. I'm unsure which model you plan to choose for this step then.

yes, true, i see now that we use this cross-encoder that was fine-tuned on gold dataset at the end of step 2. but is it possible to use already pre-trained cross-encoders at step 2? for example: sentence-transformers/ce-roberta-large-stsb

langineer commented 3 years ago

@NThakur20, seems like i didn't get fully the idea of augmentation, was checking what is the main difference in training procedure between train_sts_indomain_semantic.py and train_sts_qqp_crossdomain.py. wanted to see if i can train unlabeled data (similar/dissimilar pairs of sentences) with train_sts_indomain_semantic.py: create soft labels (any float from 0 to 1) and then train bi-encoder with CosineSimilarityLoss that is used in train_sts_indomain_semantic.py rather than create binary labels and train with MultipleNegativesRankingLoss that is used in train_sts_qqp_crossdomain.py; but after such training evaluation score on STS benchmark got worse (0.79); i also tried to use 'distilroberta-base-paraphrase-v1' to create embeddings at step 2 and use it at as bi-encoder to fit the data at step 3 and then score got better (0.84) but there is in no improvement after first epoch. not sure why is that behaviour?

thakur-nandan commented 3 years ago

but is it possible to use already pre-trained cross-encoders at step 2? for example: sentence-transformers/ce-roberta-large-stsb

Hi @langineer, Yes, you can go ahead and use a fine-tuned cross-encoder for step 2. Just make sure you are labeling only the silver pairs i.e. combinations of sentences not in the original STSb dataset.

thakur-nandan commented 3 years ago

seems like i didn't get fully the idea of augmentation

You can read section 3.1 from the original paper here - https://arxiv.org/abs/2010.08240

was checking what is the main difference in training procedure between train_sts_indomain_semantic.py and train_sts_qqp_crossdomain.py.

In train_sts_indomain_semantic.py, we train our cross-encoder and bi-encoder on both the same dataset, in the example as shown for the STSb dataset. On the contrary, for train_sts_qqp_crossdomain.py, we train a cross-encoder over the source domain i.e. STSb, and then create binary labels for the QQP dataset (Because QQP is a pairwise-classification task) and train the QQP model finally with the bi-encoder.

wanted to see if i can train unlabeled data (similar/dissimilar pairs of sentences) with train_sts_indomain_semantic.py: create soft labels (any float from 0 to 1) and then train bi-encoder with CosineSimilarityLoss that is used in train_sts_indomain_semantic.py rather than create binary labels and train with MultipleNegativesRankingLoss that is used in train_sts_qqp_crossdomain.py; but after such training evaluation score on STS benchmark got worse (0.79);

If you get a bad-score, two things might be crucial to evaluate.

  1. how was your cross-encoder trained?
  2. What is this unlabeled training data - Is it unused combinations from STSb like shown in the example?

i also tried to use 'distilroberta-base-paraphrase-v1' to create embeddings at step 2 and use it at as bi-encoder to fit the data at step 3 and then score got better (0.84) but there is in no improvement after first epoch. not sure why is that behaviour?

I am unsure, for which example you mention this issue. if this is for train_sts_qqp_crossdomain.py, I could imagine distilroberta-base-paraphrase-v1 performing well for quora as both are similar tasks.

langineer commented 3 years ago

Hi @NThakur20, thanks for detailed reply.

You can read section 3.1 from the original paper here - https://arxiv.org/abs/2010.08240

yes, i read the paper and this section was not that clear for me: i understood that it is supposed to be 2 parts:

I am confused that i don't see random sampling being done in 'train_sts_indomain_semantic.py', it seems that we augment the data mostly with positive pairs then; may be i am missing something, but i understood that in the paper both random sampling (that would lead to negatives) as well as more selective sampling (that would lead to similar pairs) are needed?

In train_sts_indomain_semantic.py, we train our cross-encoder and bi-encoder on both the same dataset, in the example as shown for the STSb dataset. On the contrary, for train_sts_qqp_crossdomain.py, we train a cross-encoder over the source domain i.e. STSb, and then create binary labels for the QQP dataset (Because QQP is a pairwise-classification task) and train the QQP model finally with the bi-encoder.

yes, that is my understanding too; train_sts_indomain_semantic.py is first trained on gold labels and then augmented with silver labels and then trained on them too. i have unlabeled data that is similar to STS - using MultipleNegativesRankingLoss (that is used in 'train_sts_qqp_crossdomain.py') wouldn't be good idea because there is similarity across pair of sentences (a_i may be similar to b_i, as well as a_i may be similar to b_j); and binary labeling is not good idea as well; so i tried to train this unlabeled data like in 'train_sts_qqp_crossdomain.py' but with CosineSimilarityLoss and continuous labels 0...1 (similiar to 'train_sts_indomain_semantic.py') but i see that after training with my unlabeled data (with 'bert-base-nli-stsb-mean-tokens' ) gives a score bit worse then it was before fitting my data; so i am thinking may be absent of gold labels good worsen the model?

I am unsure, for which example you mention this issue. if this is for train_sts_qqp_crossdomain.py, I could imagine distilroberta-base-paraphrase-v1 performing well for quora as both are similar tasks.

if using distilroberta-base-paraphrase-v1 (in training as described above) then the score improves after first epoch (in comparison to 'bert-base-nli-stsb-mean-tokens') and then gradually decrease each next epoch (i tried 8 epochs) -giving impression of overfitting; while in comparison to 'bert-base-nli-stsb-mean-tokens' the score is worse at first but gradually increasing every next epoch. may be distilroberta-base-paraphrase-v1 is less similar task to what is done in train_sts_indomain_semantic.py but i guess if it is unsuitable then it would rather yield bad score from the beginning of training?

If you get a bad-score, two things might be crucial to evaluate.

how was your cross-encoder trained? _- the part with cross-encoder i didn't change and took it as it is in 'train_sts_indomainsemantic.py' What is this unlabeled training data - Is it unused combinations from STSb like shown in the example? -this unlabeled training data is also unused combinations but not part from STSb but part of my data that is unlabeled. (i don't have gold labeled data, only unlabeled)

thakur-nandan commented 3 years ago

yes, i read the paper and this section was not that clear for me: i understood that it is supposed to be 2 parts:

  • random sampling, that will probably lead to selecting negative pairs and
  • similar pairs sampling for the silver dataset; (it is done with KDE, BM25, SS) Semantic Search Sampling (SS) is example of similar pairs sampling and that is what is done in the script 'train_sts_indomain_semantic.py'.

I am confused that i don't see random sampling being done in 'train_sts_indomain_semantic.py', it seems that we augment the data mostly with positive pairs then; may be i am missing something, but i understood that in the paper both random sampling (that would lead to negatives) as well as more selective sampling (that would lead to similar pairs) are needed?

If we add all randomly sampled pairs, not all pairs contribute well. Two random sentences would not help semantical learning the model. Sampling out combinations using BM25, SS or KDE helps, the best examples out of the combination to be easily put. In train_sts_indomain_semantic.py we implement a Semantic Search sampling strategy. We combine the pairs which don't occur in the original dataset. These pairs of not positive pairs, the labeling are done using the cross-encoder, it can be either positive of negative.

thakur-nandan commented 3 years ago

yes, that is my understanding too; train_sts_indomain_semantic.py is first trained on gold labels and then augmented with silver labels and then trained on them too. i have unlabeled data that is similar to STS - using MultipleNegativesRankingLoss (that is used in 'train_sts_qqp_crossdomain.py') wouldn't be good idea because there is similarity across pair of sentences (a_i may be similar to b_i, as well as a_i may be similar to b_j); and binary labeling is not good idea as well; so i tried to train this unlabeled data like in 'train_sts_qqp_crossdomain.py' but with CosineSimilarityLoss and continuous labels 0...1 (similiar to 'train_sts_indomain_semantic.py') but i see that after training with my unlabeled data (with 'bert-base-nli-stsb-mean-tokens' ) gives a score bit worse then it was before fitting my data; so i am thinking may be absent of gold labels good worsen the model?

The In-domain examples are not applicable in your case as you don't have any gold-labeled data is what I understood. So you will have to use the cross-domain example mentioned here - train_sts_qqp_crossdomain.py.

Now, you could probably try with sentence-transformers/ce-roberta-large-stsb and use this model to label your unlabelled dataset (Don't convert to binary scores). The cross-encoder will label a continuous score between 0-1. After labeling from this cross-encoder, now train a bi-encoder from scratch over your labeled dataset.

langineer commented 3 years ago

Hi @NThakur20, thanks for the answer,

Now, you could probably try with sentence-transformers/ce-roberta-large-stsb and use this model to label your unlabelled dataset (Don't convert to binary scores). The cross-encoder will label a continuous score between 0-1

yes, i use pre-trained sentence-transformers/ce-roberta-large-stsb to label the data;

but before i encode the the sentence with pre-trained model xlm-r-bert-base-nli-stsb-mean-tokens to choose top-k pairs (originally, dataset is not paired) though in train_sts_qqp_crossdomain.py you don't do that step of encoding sentences before feed them in cross-encoder, but i guess it shouldn't ruin the idea of further training?

now train a bi-encoder from scratch over your labeled dataset.

not getting the logic why train bi-encoder from scratch over my newly-labeled dataset? is it possible use pre-trained model to train over this just labeled dataset(at the end of step 3), for example the same model as i used above to encode sentences ->(xlm-r-bert-base-nli-stsb-mean-tokens)?

thakur-nandan commented 3 years ago

Sorry for the delayed reply @langineer,

We take sentence pair datasets for our experiments (like STS or QQP) containing gold-labeled sentence pairs. If you utilize another model for pairing, this also is not gold pairs but rather silver pairs, i.e. pairs returned are dependent upon the xlm-r-bert-base-nli-stsb-mean-tokens model.

You can fine-tune a bi-encoder from scratch or use a pre-trained model (for eg. xlm-r-bert-base-nli-stsb-mean-tokens) and further finetune on your dataset. During our experiments, we found the bi-encoder from scratch performing better, however, you can try both and chose the model with better performance.