UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.05k stars 2.45k forks source link

Question about training with BatchHardTripletLoss #636

Open tide90 opened 3 years ago

tide90 commented 3 years ago

Maybe it is a naive questions (as I am not native to pyTorch).

When training as an example shown here (to the above loss mentioned):


model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Sentence from class 0'], label=0), InputExample(texts=['Another sentence from class 0'], label=0),
    InputExample(texts=['Sentence from class 1'], label=1), InputExample(texts=['Sentence from class 2'], label=2)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.BatchSemiHardTripletLoss(model=model)

how is there a siamese model trained where I have two inputs? Because you are using a SentenceTransaformer (which maps single input to output). Also in your bi-encoder example you build a sentence transformer from scratch. I just wonder how training in a siamese manner happens?

In my understanding SentenceTranformer is a siamese bi-encoder (like in your paper).

Otherwise in your Quora example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/quora_duplicate_questions/training_multi-task-learning.py

also SentenceTransformer model is trained and gets the 2 inputs for sentences pairs. I wonder where and when the model "knows" how to fit depending on the number of inputs? I feel I miss something. When is "siamese" model trained and when a "single" model with 1 input?

nreimers commented 3 years ago

Hi @tide90 Training always involves the computation of 2 or more embeddings and comparing them (for example, with cosine similarity).

BatchSemiHardTripletLoss is actually a bit more complicated: You have a batch with e.g. 64 sentences and labels. Then, the loss find in the batch the samples with the same label which form an (anchor, positive) pair. Sentences with a different label are negative pairs, so you get a triplet (anchor, positive, negative).

This is then optimized so that (anchor, positive) are close and (anchor, negative) are far apart in vector space.

tide90 commented 3 years ago

Training always involves the computation of 2 or more embeddings and comparing them (for example, with cosine similarity).

Could you be more specific? I just know the case training a siamese network (sahred weights) like with tensorflow. I do not see how training happens if you just compute 2 embeddings but just feeding one input?

I understand the BatchHard Loss better, but do not know what is the underlying model architecture? This loss expects a single input and compute per batch on the fly different pairs. That is understandable.

Although you have obviously same "model training" usage for different types of inputs as with quora vs bi-enocder (using 2 vs 1 input). So in tensorflow you would assume having different model architectures.

EDIT:

origianlly I have always simaese networks with contrastive loss in my mind. And you make this image also in your paper. So maybe my image having a siamese network is wrong?

nreimers commented 3 years ago

Yes, the image does not match for BatchHard loss.

For a nice write-up on triplet loss and batch hard loss: https://omoindrot.github.io/triplet-loss

tide90 commented 3 years ago

Sorry, but maybe it is too trivial. I feel you don't go into my points above. I have a complete understanding of the loss (and know this blog). But where is my issue as should be clear from above are the models used and its training. I wrote differnt use cases (like quora) all having the same "training style" which confuses me. Which models are used and how are they trained? In the case for quora you have 2 inputs (the pairs) which should processed by a siamese network. For bi-enoder you use 1 input. All having the same style and using "SentenceTransformer" which feels like a model like having single input?! where does the siamese model comes from to train for quora then?

datistiquo commented 3 years ago

@nreimers

Is training with BatchHardTripletLoss as simple as just using this example below the low from the docs (and also shown in the first post above)?

I just need to feed 1 sentences and its corresponding class label? Since it computes automatic in a more intelligent way the triplets within each batch, I do not have to take care to group them by my own like in the MultipleNegativesRankingLoss ? https://github.com/UKPLab/sentence-transformers/issues/641

nreimers commented 3 years ago

An example is here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec_continue_training.py

You should ensure that each batch has for every included class label at least two example. This can realized, again, via a pytorch datasampler that constructs your batches with the needed properties.

Such a sampler is implemented here: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/sampler/LabelSampler.py

datistiquo commented 3 years ago

@nreimers Thanks! I think this is exactly what was latest an question in other issues about MultirankingRanking loss. And this sampler is a good starting point for this loss? And this code example is out there since aug 04.

In the BatchHard example: Why do you use there a triplet generator? It seems only for dev/test cases. But training is without giving triplets?

This goes towards the direction of the issuer. I feel also a little confusion about this, because

from model = SentenceTransformer(model_name) most examples always look similiar, but training and though the underlying model architecture should be actually different? Batchhard loses uses a single model, whereas model using contastive or Triplet loss using a siamese/triplet network?

nreimers commented 3 years ago

@datistiquo All the losses use single networks / models: The inputs are passed through the same network (same object) for all the cases.

The only differences are how the losses are computed. And how to compute the loss depends on your available training data and the properties these have. So based on what labeled data you have, you have to choose the right loss.

BatchHard generates the triplets online (as described in the above blog post). So no need to generate triplets yourself, the loss will look into the batch and create all possible triplets from it.

For evaluation, however, we want to see how well it works for specific triplets. So we create some fixed triplets and evaluate the model on it.

datistiquo commented 3 years ago

Hey @nreimers Many thanks! I already understand that you also have in a tf fashion a model with 1 input for those batchtripletlosses. But as stated multiple times, for losses like contrastive you would have a model which requires to have 2 or 3 inputs (via Input Layers in tf fashion). That is a little bit confusing when using the "same" training syntax.

in the batchhard example you use SentenceLabelDataset but also showed this LabelSampler. Are both the same? I thought the triplet were computed within the loss function and not via the dataset?

Also I assume if commenting out the loss used in this example and use one of the other (out commented below) losses there would need no changes?

datistiquo commented 3 years ago

@nreimers

In the SentenceLabelDataset you have in the getitem

return [anchor, positive, negative], self.grouped_labels[item] So, I thought the triplets are computed within the loss function? Why do compute there these triplets?

tide90 commented 3 years ago

@nreimers . I want to get back to the issue with training as also mentioned from @datistiquo.

In https://github.com/UKPLab/sentence-transformers/blob/24467c7af488f3672acee1bc5ff2efc509dc4e6e/sentence_transformers/SentenceTransformer.py#L451

I cannot find out why and how the model training depends on the kind of trainingdata. When using different losses I have to have a kind of siamese or triplet or just a single network pass (for the bacthTripletsLosses). Where is this happining?

I think this is also mentioned in the second last post about the input depending on the various losses but having the "same synatx" for training.

I would also love to know:

in the batchhard example you use SentenceLabelDataset but also showed this LabelSampler. Are both the same? I thought the triplet were computed within the loss function and not via the dataset?

Also I assume if commenting out the loss used in this example and use one of the other (out commented below) losses there would need no changes?

In the SentenceLabelDataset you have in the getitem

return [anchor, positive, negative], self.grouped_labels[item] So, I thought the triplets are computed within the loss function? Why do compute there these triplets?

Is it correct that the dataloader uses the SentenceLabelDataset object to genrate the bacthes on the fly while training? So is this for a preselection to have only some triplets on each batch before calculating the actual triplets?

Thanks!

nreimers commented 3 years ago

@tide90 The relevant part is in the loss functions. They define how the loss is computed and which sentences are compared.

The SentenceLabelDataset and the LabelSampler are rather old, outdated and complicated scripts

In the BatchHardExample, provide_positive and provide_negative are both False. In that case, SentenceLabelDataset just returns the InputExample without any changes. So it is identical to a list of InputExample and returning an element at a specific position.

But the SentenceLabelDataset sorts the elements so that examples with the same label are right next to each other, e.g: [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, ...]

The LabelSampler then ensures that each batch has multiple samples with the same label. If an example with e.g. label 1 is picked, it will pick other examples also with label 1.

The implementation is rather complicated and stems from an very old version of sentence transformers. A more efficient implementation would be possible (especially with the upcoming version 0.4.1).

For more details on datasets, dataloaders and data samplers, I highly recommend to have a look at this article that explains the fundamentals: https://pytorch.org/docs/stable/data.html

tide90 commented 3 years ago

I understand Sampler as something which takes the batch (from the dataloader) and yields (1 example) sequentially with some sought operations (like randomly pick 1 example).

I got the impression of the SentenceTranforer Freamework that all is implemented (otherwise they weren't all these losses). So this framework needs to be adpated with my own dataloaders etc. But I would assume to have standard examples for each loss (dataset, dataloaders) to be able to use them properly. A basic example like above is good, but this now makes it diffuclt to apply to the other batchhardlosses because of swithcing the _provide_negative or providepositive (A comment in the code when to do what would be good). A comment or hint especially for the MultiRankingLoss should be there.

I still not understand why you have this:

In the SentenceLabelDataset you have in the getitem return [anchor, positive, negative], self.grouped_labels[item]

The loss takes as input just the examples, and calculates the triplets.

In the BatchHardExample, provide_positive and provide_negative are both False. In that case, SentenceLabelDataset just returns the InputExample without any changes. So it is identical to a list of InputExample and returning an element at a specific position.

But the SentenceLabelDataset sorts the elements so that examples with the same label are right next to each other, e.g: [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, ...]

Do interchnaged the words SentenceLabelDataset and LabelSampler ?

The LabelSampler then ensures that each batch has multiple samples with the same label. If an example with e.g. label 1 is picked, it will pick other examples also with label 1.

So, using this loss I need to use this Sampler? But this Sampler is not used in the dataset? Very confusing.

tide90 commented 3 years ago

Actually, I thought using some of these BatchTripletLosses I do not need any advcanced dataloader/dataset because the loss calculated all which is needed. The only thing I need is maybe to ensured at least 2 positives examples in each batch?

nreimers commented 3 years ago

@tide90 Have a check at the released 0.4.1 version: https://github.com/UKPLab/sentence-transformers/releases/tag/v0.4.1

The SentenceLabelDataset has been substantially overworked: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/SentenceLabelDataset.py

And the example was also updated: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec.py

If you use that dataset, it is ensured that each batch has at least 2 examples from each label class.

There is no longer a need for a sampler and it can be quite easily be done using the dataset class from torch.

tide90 commented 3 years ago

@nreimers Thank you. I test it right now.

So, you train all this batch hard losses with single examples and labels, like normal text classification. Can I use such a trained model to do semantic matching via the sentence encodings and cosine simalirity (see below)?

I have the IR evaluator (like here), where I compute the similarity between two sentences. If I train with the loss and the euclidean distance, should I use the euclidean distance for using the sentence embeddings ( for the similarity between two sentences) too? I guess that is somehow important, right? In your IR evaluator you use by default the cosine sim. So I should use the same distance metric for the loss as for my evaluation?

So this should you know before combining the batch hard losses with the IR-evaluator.

Also, the question is if the cosine sim is even suitable if you trained via cosine distance. So, the loss and the distance function are forming the embeddings via training, so you should use the same distance for evaluation for the embeddings to yield sim meaning. Do you understand what I mean?

tide90 commented 3 years ago

@nreimers . I would assume that you should use the same distance metric like in your loss, as such the vectors are trained via this metric. So using this metric in your evaluation/tests is better.

What do you think about evaluation of word embeddings using the cosine similiartiy although if the loss uses the cosine distance? I think the cosine sim score should capture the same informations as the cosine distance? As it is just some translation in vectorspace.

tide90 commented 3 years ago

@nreimers Was my question clear or is it too trivial? :-) I think intuitively you need the same metric for calculating the semantic of an emebedding like during the training in the loss...

nreimers commented 3 years ago

Cosine distance and cosine similarity are basically the same.

If vectors have limited length, e. g. vector length are all below a threshold. Then cosine similarity and euclidean distance are more or less the same (plus some factors)

terry07 commented 3 years ago

Dear all,

I have just started using the triplet losses for my research. However, since I have not studied much of the PyTorch, I have the next two queries:

i) How can I see and store the triplets that are formatted on the testing phase? For example, in the training_batch_hard_trec.py script, what should I add there for saving the test set based on triplets? ii) In the case that not all the triplets are examined, is any standard evaluation protocol for assessing the accuracy scores that you mention in the script?

Thanks for your time.

nreimers commented 3 years ago

@terry07 1) You can use any format you like: csv, tsv, json, pickle

2) The example constructs a sample of triplets. Examining and evaluating all possible triplets in that scripts is not really possible, as there are far too many combinations. So the standard way is to test some triplets and see how it performs on these.

terry07 commented 3 years ago

@nreimers thanks for the answer.

Could you help me with the command for saving the test set? Moreover, can the trained model provide scores per each examined distance? For example, for any triplet, does the decision come as a boolean answer, or there is a distance measurement between anchor and positive/negative sample, before the corresponding comparison take place?

Thanks again for your time.

nreimers commented 3 years ago

https://stackoverflow.com/questions/11218477/how-can-i-use-pickle-to-save-a-dict

Works also with any other data type in python