Open kaminocode opened 4 years ago
Hi, your train / dev file should look like this: anchor1 positive1 negative1 anchor2 positive2 negative2
Separated by tabs (\t).
The difficult task is how to choose the positive / negative examples. You find a lot of literature on this, and that selecting the negative example can be of high importance for the performance.
This paper can be of interest: https://arxiv.org/pdf/1703.07737.pdf
The described batch hard strategy is also implemented in this framework.
How to choose the positive / negative example depends on your task, so there is no general rule for that. Triplet loss tries to bring anchor and positive close together, while maximizing the distance between anchor and negative.
Best Nils Reimers
I have a sequence classification dataset, which I want to use to make sentence embeddings using triplet loss. How should I restructure the dataset to make it compatible with the codebase. Also how should I chose the triplets: the positive and negative examples.