UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.09k stars 2.46k forks source link

In-batch negatives with Multiple Negative Ranking Loss #1587

Open joaocp98662 opened 2 years ago

joaocp98662 commented 2 years ago

Hi,

I'm trying to fine-tune a model on my dataset but I'm having some troubles. In my dataset I have pairs of documents, topics (queries) and their scores (0 - not relevant, 1 - relevant). Only 10% of the pairs are positive. I'm trying to fine-tune a pre-trained model using the MNR loss but I'm stuck, not knowing how to pass the data to the DataLoader.

Data example:

TopicID DocID Score 20141 NCT00000408 0 20141 NCT00000492 0 20141 NCT00000501 0 20141 NCT00001853 0 20141 NCT00004727 0 20141 NCT00005127 1 20141 NCT00005485 1 🔴 … 201518 NCT00005485 1 🔴 201518 NCT00005499 0 201518 NCT00012818 1 201518 NCT00013026 1 201518 NCT00053534 0 ...

(Note: I'm preprocessing the data and I using the corpus of the documents/topics and not the ids)

Each topic can have multiple relevant documents and a document can be relevant to different topics. Do I have to write a specific data loader?

Thank you very much for your time.

mscham commented 2 years ago

Check out the example here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py mnrl will take triplets as in this example or you can just use positive pairs

mscham commented 2 years ago

Also take a look at the GPL technique for augmenting your data...

https://www.pinecone.io/learn/gpl/ https://github.com/UKPLab/gpl

joaocp98662 commented 2 years ago

@mscham thank you so much for your reply. I end up implementing mnrl with triplets as you suggested. For each positive pair I added an additional negative for that anchor. I trained the model and used it to my task which is retrieval. After evaluating my ranking, the results were far worst compared to using the model trained with only positive pairs. I was expecting better results using the triplets. I'm doing something wrong maybe? When training the model I'm using batches of size 8, if I increase the batch size it will not fit on the GPU's I am using.

joaocp98662 commented 2 years ago

I noticed that I wasn't preventing duplicates in my batches so now I'm using NoDuplicatesDataLoader class and the results improved slightly. But again, using the model fine-tuned with only positive pairs is giving me better results than using the model fine-tuned with triplets (positive pair + negative from that query) on my ranking task. Should I assume worst results by adding the triplets when training the model? I was expecting better results.