UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.35k stars 2.48k forks source link

Question: Does MultipleNegativesRankingLoss accepts both pairs and triplets in the dataset? #2681

Open bely66 opened 5 months ago

bely66 commented 5 months ago

For example: I have multiple datasets:

  1. Dataset with pairs (translations, summaries, qa)
  2. Dataset with triplets (duplicated and non-duplicate questions, NLI (entailing, contradiction sentences), (relevant, irrelevant questions))

Now I want to concatenate these datasets to train my model, is there a way to do that where I have, pairs and triplets in the same dataset and how to do that

tomaarsen commented 5 months ago

Hello!

There's kind of 2 answers to this, before v3 and after v3 (i.e., presumably after this week). In both cases, the answer is "Yes, that's possible". The example is even in the same file, just in a different branch of the codebase.

pre v3 In training_multi-task.py in the master branch here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py#L108 You can see how you can combine multiple dataloaders, e.g. 1 for the pairs and 1 for the triplets, or 1 for each of the datasets each (e.g. 1 for translations, one for summaries, etc.). Note that pre v3, these training objectives are used in "round robin style", so it fetches one batch from each of the dataloaders over and over until 1 of the dataloaders is empty. Then it stops fully. So, it trains with the same amount of samples from each dataloader.

post v3 In training_multi-task.py in the v3.0-pre-release branch here: https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/examples/training/other/training_multi-task.py#L92 You can see how you can combine multiple datasets via the train_dataset, eval_dataset, and loss (although if the loss is the same between all train/eval datasets, then you can just pass the loss directly rather than a dictionary of dataset names to the same loss). By default, this uses the "Proportional" multi-dataset batch strategy, i.e., each dataset is sampled proportionally often to their size, so all data is sampled from, but not all datasets are learned from equally. You can update this by setting the multi_dataset_batch_sampler=MultiDatasetBatchSamplers.ROUND_ROBIN to get the previous behaviour.

And yes, both pairs and triplets work for MultipleNegativesRankingLoss, and the multi-task/dataset should work quite well with it.

Whether you choose pre v3 or post v3 is up to you, but v3 should launch rather soon. The documentation of v3 is kind of hard to read for now (it's in the docs directory in the v3.0-pre-release branch on GitHub, but it's a tricky mix between markdown, HTML and reStructuredText. You can compile it yourself but that's also not very convenient). That said, this (still unreleased) blogpost might help if you want to get a head start with some text that's a bit easier to read: https://github.com/huggingface/blog/pull/2104

bely66 commented 5 months ago

@tomaarsen

Thanks a lot for the quick response I really appreciate it

Would love to contribute, any references on how to do that?