UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.73k stars 2.43k forks source link

Multitask Training #159

Closed djstrong closed 4 years ago

djstrong commented 4 years ago

Multitask training example would be helpful.

nreimers commented 4 years ago

Hi @djstrong I added an example for multi-task learning here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_multi-task.py

Best Nils Reimers

djstrong commented 4 years ago

Thank you!

RobertHua96 commented 4 years ago

Hi @nreimers

image

I tried following this example - code verbatim - but for some reason it seems like the model is only being trained on STS, and the ALLNLI dataset is ignored.

Feels that way because when training on datasets individually, ALLNLI takes a few hours for 1 epoch. Here my epochs finish in a matter of 20 minutes.

Would you have any ideas on how I could fix this?

nreimers commented 4 years ago

How does the rest of your code look like?

Have you used this example? https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_multi-task.py

RobertHua96 commented 4 years ago

Yes, sorry the above link was broken so I found that file through search and used it verbatim

From my Google colab:

image

The datasets are logged as read. In another colab I managed to successfully fine tune a Roberta model on ALLNLI for single task training. For some reason, taking that model and then trying to find tune it on STS for sequential learning, leads to no change in loss - which I'm trying to investigate too

Code in text:

Convert the dataset to a DataLoader ready for training logging.info("Read AllNLI train dataset") train_data_nli = SentencesDataset(nli_reader.get_examples('train.gz'), model=model) train_dataloader_nli = DataLoader(train_data_nli, shuffle=True, batch_size=batch_size) train_loss_nli = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels)

logging.info("Read STSbenchmark train dataset") train_data_sts = SentencesDataset(sts_reader.get_examples('sts-train.csv'), model=model) train_dataloader_sts = DataLoader(train_data_sts, shuffle=True, batch_size=batch_size) train_loss_sts = losses.CosineSimilarityLoss(model=model)

logging.info("Read STSbenchmark dev dataset") dev_data = SentencesDataset(examples=sts_reader.get_examples('sts-dev.csv'), model=model) dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=batch_size) evaluator = EmbeddingSimilarityEvaluator(dev_dataloader)

train_objectives = [(train_dataloader_nli, train_loss_nli), (train_dataloader_sts, train_loss_sts)]

warmup_steps = math.ceil(len(train_dataloader_sts) num_epochs / batch_size 0.1) #10% of train data for warm-up logging.info("Warmup-steps: {}".format(warmup_steps))

wangrunchuan commented 3 years ago

Hi @nreimers

image

I tried following this example - code verbatim - but for some reason it seems like the model is only being trained on STS, and the ALLNLI dataset is ignored.

Feels that way because when training on datasets individually, ALLNLI takes a few hours for 1 epoch. Here my epochs finish in a matter of 20 minutes.

Would you have any ideas on how I could fix this?

The value "steps_per_epoch" in SentenceTransformer is the minimum length of dataloader. Maybe the size of STS data is smaller than ALLNLI data, so not all ALLNLI data are trained.

ahmedbesbes commented 3 years ago

Hello @nreimers Great implementation of multi-tasking. I have a question though: how would you attribute different weights to each loss? Thanks

nreimers commented 3 years ago

Hi @ahmedbesbes I would create a new loss (just copy the code from the loss class you use) and add there the option to configure the weight before it is returned from the forward() method

RobertHua96 commented 2 years ago

Hi @nreimers sorry for the basic question but just wanted to confirm - in a multi task setting if I took a loss and multiplied it by say 0.5, that would halve that loss's effect for model updates? And conversely if I do something like loss * 2 it would double the loss's effect for model updates?

nreimers commented 2 years ago

Yes

Xan1912 commented 2 years ago

Hi, @nreimers. Thanks for the multi-tasking setup. One question: Is this multi-tasking scenario already using any kind of shared encoder between the tasks?