Open joaocp98662 opened 2 years ago
Check out the example here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py mnrl will take triplets as in this example or you can just use positive pairs
Also take a look at the GPL technique for augmenting your data...
https://www.pinecone.io/learn/gpl/ https://github.com/UKPLab/gpl
@mscham thank you so much for your reply. I end up implementing mnrl with triplets as you suggested. For each positive pair I added an additional negative for that anchor. I trained the model and used it to my task which is retrieval. After evaluating my ranking, the results were far worst compared to using the model trained with only positive pairs. I was expecting better results using the triplets. I'm doing something wrong maybe? When training the model I'm using batches of size 8, if I increase the batch size it will not fit on the GPU's I am using.
I noticed that I wasn't preventing duplicates in my batches so now I'm using NoDuplicatesDataLoader class and the results improved slightly. But again, using the model fine-tuned with only positive pairs is giving me better results than using the model fine-tuned with triplets (positive pair + negative from that query) on my ranking task. Should I assume worst results by adding the triplets when training the model? I was expecting better results.
Hi,
I'm trying to fine-tune a model on my dataset but I'm having some troubles. In my dataset I have pairs of documents, topics (queries) and their scores (0 - not relevant, 1 - relevant). Only 10% of the pairs are positive. I'm trying to fine-tune a pre-trained model using the MNR loss but I'm stuck, not knowing how to pass the data to the DataLoader.
Data example:
TopicID DocID Score 20141 NCT00000408 0 20141 NCT00000492 0 20141 NCT00000501 0 20141 NCT00001853 0 20141 NCT00004727 0 20141 NCT00005127 1 20141 NCT00005485 1 🔴 … 201518 NCT00005485 1 🔴 201518 NCT00005499 0 201518 NCT00012818 1 201518 NCT00013026 1 201518 NCT00053534 0 ...
(Note: I'm preprocessing the data and I using the corpus of the documents/topics and not the ids)
Each topic can have multiple relevant documents and a document can be relevant to different topics. Do I have to write a specific data loader?
Thank you very much for your time.