Open Muennighoff opened 2 years ago
Hi @Muennighoff,
Yeah, we tried that. Actually what you said seems to be exactly SBERT-base-nli-v2
, SBERT-base-nli-stsb-v2
(zero-shot models) and SBERT-supervised
(in-domain supervised) in Table 2. All of them were trained with Mutiple-Negative-Ranking-Loss, which is equivalent to SimCSE's supervised objective. The description can be found in Section 5.1 Baseline Method in the paper. For the training code, one can refer to it here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py.
Hi @Muennighoff,
Yeah, we tried that. Actually what you said seems to be exactly
SBERT-base-nli-v2
,SBERT-base-nli-stsb-v2
(zero-shot models) andSBERT-supervised
(in-domain supervised) in Table 2. All of them were trained with Mutiple-Negative-Ranking-Loss, which is equivalent to SimCSE's supervised objective. The description can be found in Section 5.1 Baseline Method in the paper. For the training code, one can refer to it here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py.
Nice thanks for this! I hadn't realized SimCSE's supervised objective was equivalent to SBERT-base-nli-stsb-v2's objective And it seems it's also equivalent to Contrastive Multiview Coding (https://arxiv.org/pdf/1906.05849.pdf) except they optionally take hard negatives from anywhere via a memory buffer, not just the current batch~
So ignoring that all the below are the same
SBERT with MultipleNegativesRankingLoss
SimCSE Supervised
Multiview Contrastive Coding
Could you provide the training code for SBERT-supervised
?
(I.e. the training on USEB)
Hi @Muennighoff,
data-train/${dataset_name}/supervised/train.org and train.para
(each two parallel lines are corresponding to one pair of gold paraphrase);data-train/twitter/supervised/train.s1 train.s2 and train.lbl
(each three parallel lines are corresponding to sentence 1, sentence 2 and gold label for these two sentences).For hyper-parameters, I trained all these SBERT-supervised
models for 10 epochs, with 0.1 * #total steps of linear warmup and early-stopping on the dev score if possible. All the other hyper-parameters are used as the default setting in SentenceTransformer.fit.
If you have further questions about this, I can give you more hints:)
Did you try SimCSE's supervised training objective in-domain on USEB? Would be interesting to compare to SBERT-supervised...!