UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.39k stars 2.49k forks source link

Training custom data using triplet loss #97

Open kaminocode opened 4 years ago

kaminocode commented 4 years ago

I have a sequence classification dataset, which I want to use to make sentence embeddings using triplet loss. How should I restructure the dataset to make it compatible with the codebase. Also how should I chose the triplets: the positive and negative examples.

nreimers commented 4 years ago

Hi, your train / dev file should look like this: anchor1 positive1 negative1 anchor2 positive2 negative2

Separated by tabs (\t).

The difficult task is how to choose the positive / negative examples. You find a lot of literature on this, and that selecting the negative example can be of high importance for the performance.

This paper can be of interest: https://arxiv.org/pdf/1703.07737.pdf

The described batch hard strategy is also implemented in this framework.

How to choose the positive / negative example depends on your task, so there is no general rule for that. Triplet loss tries to bring anchor and positive close together, while maximizing the distance between anchor and negative.

Best Nils Reimers