UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Pretraining objectives with MultiLabel? #1320

Open ddofer opened 2 years ago

ddofer commented 2 years ago

Are there any appropriate setups or losses in sentence-transformers for pretraining sentence embeddings in cases where I have labels as targets? (I want to finetune the actual embeddings, not just a seperate classifier layer, as I want to analyze the "supervised" embeddings as well. The data is fundamentally that of a single text column, but I can always duplicate part of it to get a 2 sentence format if needed). Thanks!

nreimers commented 2 years ago

Yes: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec.py

ddofer commented 2 years ago

As written, that assumes multiclass, no? I.E it would make duplicates per label, and in addition you'd need to loop over everything per the number of labels and drop duplicates (input_examples) by the label?

nreimers commented 2 years ago

Works for any number of labels. Assumes that you have for every label at least 2 train examples

ddofer commented 2 years ago

So, one should use SentenceLabelDataset, with each sentence X label added as a seperate example? (Won't that mean the same sentence can be a "negative" for itself? e.g. if it appears multiple times for each label, seperately, then it can be a negative for other labels/instances of itself)?

Thanks!

nreimers commented 2 years ago

Sentences with the same label are considered as positives, sentences with different labels as negative