UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.52k stars 2.41k forks source link

Best approach to domain-specific semantic search? #1903

Open shensmobile opened 1 year ago

shensmobile commented 1 year ago

Hi everyone,

I have worked with Transformers for just over a year now for other applications like classification and NER, but want to experiment with semantic search now. I work in a pretty specific domain (medical) but have access to a lot of sample documents. I typically work with long documents (easily over 300 tokens) so I expect that I'll need to work with asymmetric search and the MS MARCO dataset.

I've looked into other issues here on Github and read through a few blog posts and I think that my approach will be to utilize my fine-tuned Bert-based MLM that I use for other downstream tasks that was trained on my medical corpus as a starting point.

I have already done a few rounds of training using the MarginMSE training script for the bi-encoder. So far, it does appear to work reasonably well for ranking a small test-set of documents, but I know that I will need a cross-encoder to re-rank and I am training that now.

Have people had good results with mixing a pre-trained MLM and the MS MARCO dataset and some unsupervised learning? I have a large batch of documents that I could use to fine-tune if unsupervised learning would not un-do what it's learning from MS MARCO. Alternatively, I was thinking about using a text summarization model to create "queries" for each of my documents and creating a dataset in that way. If I were to do this, should I just out-right replace MS MARCO or use a combination of the two?

Or should I just outright skip all of this and go straight to one of the unsupervised techniques like DeCLUTR?

Follow-up:

After some additional training time, I have both MarginMSE and MultipleNegativesRankedLoss trained models using the MS MARCO dataset that used my fine-tuned MLM as a backbone. I have also trained a CrossEncoder using MS MARCO as well with my MLM.

Strangely enough, the MNRL model (using cos_sim) appears to be better at retrieving data than the MarginMSE model (using dot product). When I expanded my test set of documents significantly, I could see that the MarginMSE model is much worse at separating relevant and irrelevant documents (even though the relevant ones do rise to the top). I have to rely heavily on the cross encoder to re-rank the MarginMSE model results. The MNRL model retrieves relevant data with much greater confidence and there's a significant difference between the relevant data and the irrelevant data.

To make sure I'm not doing anything wrong, with the output of both MNRL and MarginMSE training scripts, I just call the models using SentenceTransformers(), encode the query and corpus, and use cos_sim() for MNRL and dot_prod() for MarginMSE right? No normalization required for either?

To improve results, I think I'm going to utilize the GPL approach from here to train the MNRL model since it appears to be much better: https://github.com/UKPLab/sentence-transformers/blob/master/examples/domain_adaptation/README.md

Am I right to think that the combination of MS MARCO and the GPL of my corpus data will yield the best possible semantic search model? And is it possible to get the Bi-Encoder to a state that will not need the cross encoder?

HenryL27 commented 1 year ago

Something I found about the GPL library: it will generate its data and then write it down so that in the future it doesn't have to go through the arduous process of re-augmenting. It also only generates as many pseudo-labels as it needs for the number of training steps you give it. This means that if you test it out first with a small number of training steps, when you go to try it on a large number of training steps it will load only the small number of labels that it generated in the first test, leading to overfitting and bad times

shensmobile commented 1 year ago

Thanks for that heads up. I decided to just try GPL last night and used the default settings with a sample corpus of over 300k reports. It should be done in about 6 hours. Fingers crossed that it improves the performance of my model.

I didn't have much time to look into the underlying train() code, but by default it looked like GPL wants to train a dot product model so I resorted to using my MarginMSE trained model. I wonder if I can get it to train using cos_sim instead and feed in my MNRL model. I feel significantly more confident using that model.

HenryL27 commented 1 year ago

The tricky thing is that the generated labels come from a cross-encoder, so there is no guarantee on how big they can be. What I've found works decently well is to rescale them to [-1,1] (there's a rescale_range param for gpl.train(); I think it's just an arithmetic normalization). That way the cosine can actually hit the target values. Best of luck!

shensmobile commented 1 year ago

If my results are good, I'll be sure to leave a final note here on my adventures with domain adaptation so others can save themselves the headache (although research is always healthy!)

mpizosdim commented 1 year ago

@shensmobile Could you share your final note? :)

xehu commented 1 year ago

@shensmobile +1 --- I am working on this exact problem (wanting to fine-tune SBERT to have domain adaptation), and would love additional resources.

shensmobile commented 1 year ago

Hey everyone,

I think that my edit to my above comment captured most of my final journey, but I know I wrote a lot more on reddit. Unfortunately the subs that I post on have gone private so I can't see my own comments!

If I recall correctly, my overall path to success was:

1) Train a MLM on my corpus of data. 2) Use the MS MARCO dataset to train a sentence embedding model using the MarginMSE training script and also trained a cross-encoder using the same training script from the MS MARCO repo. Even though the MS MARCO dataset is not domain specific, it's a stepping stone to building some comparitive knowledge. 3) Take my corpus of data from step 1 and throw it into GPL. I mimicked the corpus.jsonl structure from the original GPL repo. 4) GPL will spit out a new massive batch of labels that can be used again to retrain the MarginMSE model.

I took the end result into FAISS and created a mini-demo of a semantic search engine. This is certainly not the end, there's a LOT of improvements that can be made. I really hope to be able to contribute more to this space once I have some more free time on my hands again. I think at this point, I would want to go through the GPL labels and devise a method to reject obviously incorrectly summarized labels for data. Then I think I would get even better performance.

lppier commented 1 year ago

How much data did u use to achieve a decent result? Curious..