centre-for-humanities-computing / dfm-sentence-transformers

Code for curating data and training sentence transformers for the Danish Foundation Models project.
MIT License
0 stars 0 forks source link

What do we do with only positive sentence pairs? #3

Closed x-tabdeveloping closed 10 months ago

x-tabdeveloping commented 11 months ago

@KennethEnevoldsen gave me the ContrastiveTensionLoss as an example of how one could do in batch-negatives for sampling, but as you can see in this example, Contrastive Tension loss with in batch negatives is used with an unsupervised training objective, so it is probably not what we're looking for.

I think MultipleNegativesRankingLoss is what we're looking for. As per the docs:

This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)…, (a_n, p_n) where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i!=j a negative pair.

For each a_i, it uses all other p_j as negative samples, i.e., for a_i, we have 1 positive example (p_i) and n-1 negative examples (p_j). It then minimizes the negative log-likehood for softmax normalized scores.

This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly.

Which essentially does the same thing as the ContrastiveParallel task that I wrote, but with in-batch negative examples and the number of negative samples is set.

MultipleNegativesSymmetricRankingLoss could also work quite well, as per the documentation:

This loss is an adaptation of MultipleNegativesRankingLoss. MultipleNegativesRankingLoss computes the following loss:For a given anchor and a list of candidates, find the positive candidate. In MultipleNegativesSymmetricRankingLoss, we add another loss term: Given the positive and a list of all anchors,find the correct (matching) anchor. For the example of question-answering: You have (question, answer)-pairs. MultipleNegativesRankingLoss just computes the loss to find the answer for a given question. MultipleNegativesSymmetricRankingLoss additionally computes the loss to find the question for a given answer.

I also quite like MegaBatchMarginLoss :

Given a large batch (like 500 or more examples) of (anchor_i, positive_i) pairs,
find for each pair in the batch the hardest negative, i.e. find j != i such that cos_sim(anchor_i, positive_j)
is maximal. Then create from this a triplet (anchor_i, positive_i, positive_j) where positive_j
serves as the negative for this triplet.

Train than as with the triplet loss
KennethEnevoldsen commented 11 months ago

Sounds like there might be a reason to use differing loss, but MultipleNegativesRankingLoss is probably a good baseline to go with and then we can do manipulations on that afterwards.

x-tabdeveloping commented 10 months ago

I consider this done