Semantic search on finetuned LM

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.13k stars 2.46k forks source link

Semantic search on finetuned LM #824

Open nithya-AK opened 3 years ago

nithya-AK commented 3 years ago

Hey! First of all, thank you for the awesome work you are doing. Would be grateful if you can help me out with the following situation: I have an unlabelled dataset which is domain specific and I want to do a semantic search. I have followed Huggingface tutorial and fine-tuned RoBERTa model on my data with MLM. Now, how can I use this fine tuned model along with RoBERTa-base tokenizer on sentence transformers to generate sentence embeddings and later do a semantic search ? I did try examples provided but i am not sure i follow it correctly.

Thanks in advance :)

nreimers commented 3 years ago

Hi @nithya-AK My Ph.D. student is currently working on this and we will integrate code and publish a paper soon.

Sadly it is not straightforward and just running MLM is not sufficient.

Also, do you have a symmetric or an asymmetric semantic search use case? https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search

nithya-AK commented 3 years ago

Oh..Okay..Looking forward to it :) I have a symmetric semantic search use case as of now. Thanks a lot.

nithya-AK commented 3 years ago

Hello again @nreimers, a quick query..doesn't this address the same use case ? Would that be useful in my scenario ?

nreimers commented 3 years ago

@nithya-AK Just doing MLM does not yield any good sentence / text embeddings. In fact, they are worse than more basic approaches like average Glove embeddings.

What you need is either: 1) Training data (i.e. data with some type of labels) 2) Need a different pre-training objective. There is DeCLUTR https://arxiv.org/abs/2006.03659 and Constrative Tension https://openreview.net/forum?id=Ov_sMNau-PF

Soon these methods will be integrated in SBERT. Further, my student developed a better approach based on denoising decoders that beats previous approaches. This will also be integrated soon in SBERT.

nithya-AK commented 3 years ago

Okay. Thanks again!