Open bely66 opened 5 months ago
Hello!
There's kind of 2 answers to this, before v3 and after v3 (i.e., presumably after this week). In both cases, the answer is "Yes, that's possible". The example is even in the same file, just in a different branch of the codebase.
pre v3
In training_multi-task.py
in the master
branch here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py#L108
You can see how you can combine multiple dataloaders, e.g. 1 for the pairs and 1 for the triplets, or 1 for each of the datasets each (e.g. 1 for translations, one for summaries, etc.). Note that pre v3, these training objectives are used in "round robin style", so it fetches one batch from each of the dataloaders over and over until 1 of the dataloaders is empty. Then it stops fully. So, it trains with the same amount of samples from each dataloader.
post v3
In training_multi-task.py
in the v3.0-pre-release
branch here: https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/examples/training/other/training_multi-task.py#L92
You can see how you can combine multiple datasets via the train_dataset
, eval_dataset
, and loss
(although if the loss is the same between all train/eval datasets, then you can just pass the loss directly rather than a dictionary of dataset names to the same loss).
By default, this uses the "Proportional" multi-dataset batch strategy, i.e., each dataset is sampled proportionally often to their size, so all data is sampled from, but not all datasets are learned from equally. You can update this by setting the multi_dataset_batch_sampler=MultiDatasetBatchSamplers.ROUND_ROBIN
to get the previous behaviour.
And yes, both pairs and triplets work for MultipleNegativesRankingLoss, and the multi-task/dataset should work quite well with it.
Whether you choose pre v3 or post v3 is up to you, but v3 should launch rather soon. The documentation of v3 is kind of hard to read for now (it's in the docs
directory in the v3.0-pre-release
branch on GitHub, but it's a tricky mix between markdown, HTML and reStructuredText. You can compile it yourself but that's also not very convenient). That said, this (still unreleased) blogpost might help if you want to get a head start with some text that's a bit easier to read: https://github.com/huggingface/blog/pull/2104
@tomaarsen
Thanks a lot for the quick response I really appreciate it
Would love to contribute, any references on how to do that?
For example: I have multiple datasets:
Now I want to concatenate these datasets to train my model, is there a way to do that where I have, pairs and triplets in the same dataset and how to do that