Fine-tuning Cross-encoder with Triplet loss

saeideh-sh commented 7 months ago

Hi, I am going to fine-tune the 'cross-encoder/ms-marco-MiniLM-L-4-v2' for the re-ranking of top documents in my retrieval schema. I have followed the instructions for the example in https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/cross-encoder/cross-encoder_reranking.py. However, with 3 epochs I have a very poor benchmarking results from the finetuned cross-encoder. Now, I want to use triplet loss instead of nn.BCEWithLogitsLoss() which is the default one. I tried to implement it as follows:`for i,elem in enumerate(train_data): if type(elem[0])!=float and type(elem[1])!=float and type(elem[2])!=float:

 train_samples.append(InputExample(texts=[elem[0],elem[1],elem[2]]). #elem[0]:query(anchor),elem[1]:positive_content, 
                                                                                                                                   #elem[2]: negative_content
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-4-v2')
batch_size = 2
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)
train_loss = losses.TripletLoss(model=model, triplet_margin=0.015, distance_metric=losses.TripletDistanceMetric.COSINE)
evaluator = TripletEvaluator.from_input_examples(dev_samples, name='cross-encoder-dev', main_distance_function=0)
steps = len(train_samples) / batch_size
evaluation_steps = int(steps * 0.10)
warmup_steps = int(steps * 0.10)
print('Number of steps:', steps)
print('Number of evaluation_steps:', evaluation_steps)
print('Number of warmup_steps:', warmup_steps)

model.fit(train_dataloader=train_dataloader, 
          loss_fct = train_loss,
          epochs=1, 
          evaluator=evaluator, 
          evaluation_steps=evaluation_steps, 
          optimizer_params={'lr': 2e-05}, 
          warmup_steps=warmup_steps, 
          scheduler='WarmupLinear',
          weight_decay=0.02)

However I have an error: # If we don't have any hooks, we want to skip the rest of the logic in 1127 # this function, and just call forward. 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], [] RuntimeError: The size of tensor a (2) must match the size of tensor b (512) at non-singleton dimension 1

It seems that the loss_fct parameter is not compatible with triplet loss or I am feeding the loss incorrectly. I was wondering if you could help me understand how I can fine-tune it with Triplet Loss or even if it is possible. Also for implementing CrossEntropy loss, I already mined hard negative for making my training and dev dataset. Do you have any suggestions for me that help me improve the performance?

PhilipMay commented 7 months ago

Hi @saeideh-sh

I do not think that this is possible and useful for a cross encoder.

Lets talk about what a x-encoder does: It reads two concatenated texts separated by a separation token. Then these "two concatenated texts separated by a separation token" flow through a BERT model and at the end we have a head. This head can do regression or classification.

This is very different from what an embedding model does. Yes - with an embedding model you can also compare texts for example. But en embedding model does not concatenate and read both texts at the same time.

Instead an embedding model encodes all sentences independently. Then something like cosine-sim. can be used to compare the vectors.

For embedding models TripletLoss can be used. It builds a batch and then it can apply TripletLoss between the texts for the batch. IMO this is not possible for a cross encoder.

saeideh-sh commented 7 months ago

Many thanks for the response, @PhilipMay!

Yes, makes sense. do you have any suggestions/tips on fin-tuning the cross-ecoder? So far I observed that it is very sensitive to the training data. My benchmarking on fine-tuned 'cross-encoder/ms-marco-MiniLM-L-4-v2 g' does not have good results. The generated scores for my top 1 retrieved documents is around 0.79 compared to 0.94 which I have from the pre-trained model.

PhilipMay commented 7 months ago

@saeideh-sh did you have a look on the examples: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/cross-encoder ?

Based on these examples I am training cross encoders to do a 4 class classification of two texts:

class 0: exactly the same texts
class 1: semantic similarity of the texts
class 2: hard negative (same topic different semantic)
class 3: neutral texts - different semantic

The accuracy on my test set is incredible good. More than 99% on a balanced dataset.

saeideh-sh commented 7 months ago

@PhilipMay , Thanks for sharing the details! Very interesting approach!! in my case, I have (question, context) pairs. So would the following be correct based on your approach?

(questions, correct context), label =0 (questions, hard negative (same topic different semantic)), label =1 (questions, different semantics), label =2

PhilipMay commented 7 months ago

Yes. That seems to be good.

saeideh-sh commented 7 months ago

@PhilipMay, In your example, you used num_labels=1 in model = CrossEncoder('distilroberta-base', num_labels=1) while you had 3 classes, i.e., label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}.

However in https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/cross-encoder/training_nli.py num_labels is the number of class. What is the difference between them? Would it be OK if I use num_class=1 while I have those 3 classes?

In my case, I want to fine-tune 'cross-encoder/ms-marco-MiniLM-L-4-v2' and it does not accept multi-class, i.e., num_class= 3.

PhilipMay commented 7 months ago

If you have 3 classes, then num_labels should be set to 3.

It changes the number of outputs of the classification head.

saeideh-sh commented 7 months ago

@PhilipMay, I do not think this num_labels works for all cross-encoder models when assigned a value greater than 1. I have tested 'cross-encoder/ms-marco-MiniLM-L-4-v2' with num_labels = 3, and I have the following error:

RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification: size mismatch for classifier.weight: copying a param with shape torch.Size([1, 384]) from checkpoint, the shape in current model is torch.Size([3, 384]). size mismatch for classifier.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3]). You may consider addingignore_mismatched_sizes=Truein the modelfrom_pretrainedmethod.

Any thoughts on that?

dsaks9 commented 1 month ago

just pass this arg to your CrossEncoder instantiation

automodel_args= {'ignore_mismatched_sizes':True}

UKPLab / sentence-transformers

Fine-tuning Cross-encoder with Triplet loss #2366