UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.97k stars 2.45k forks source link

Fine-tune underlying language model for SBERT #1017

Closed vdabravolski closed 2 years ago

vdabravolski commented 3 years ago

Hi,

I'd like to use SBERT model architecture for document similarity and topic modelling tasks. However, my data corpus is fairly specific to domain, and I suspect that SBERT will underperform as it was trained on generic WIki/Library corpuses. So, I wonder if there are any recommendation around fine-tuning of underlying language model for SBERT.

I envision that the overall process will be following:

  1. Take pre-trained BERT model
  2. Fine tune Language Model on domain-specific corpus
  3. Then retrain SBERT model architecture on specific tasks (e.g. SNLI dataset/task)

Curious to hear thought on the approach and problem definition.

pritamdeka commented 3 years ago

This is very similar to what I did a few days back. I fine tuned a few of the available bio models in the Huggingface repo. My aim was to fine tune the models for sentence similarity task for biomedical texts. But the problem is for biomedical texts there are not much data available for STS task. Normally you can follow the fine-tuning approach provided in the the SBERT documentation. You just need to run the python script for the fine-tune task and add the model name from the Huggingface repo for whichever model you need to fine-tune.

nreimers commented 3 years ago

Yes, you can first train with MLM on your data: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html

And then use the datasets from here and train your custom model: https://www.sbert.net/examples/training/paraphrases/README.html

Note: NLI data is rather narrow, the embeddings you can derive from it are not the best. Better to train broadly on the datasets from the paraphrase page. Yields much better embeddings.

vdabravolski commented 3 years ago

Yes, you can first train with MLM on your data: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html

And then use the datasets from here and train your custom model: https://www.sbert.net/examples/training/paraphrases/README.html

Note: NLI data is rather narrow, the embeddings you can derive from it are not the best. Better to train broadly on the datasets from the paraphrase page. Yields much better embeddings.

Thanks @nreimers, precisely what i was looking for.

RobertHua96 commented 3 years ago

Sorry for the really naive question, but I'm wondering if the MLM code spits our a 'Huggingface' model that can then be loaded with their native code? Reason being I want to train a MLM model with 'prompting' then intiialise it with a 'Zero Shot Classification' head / class of theirs - is that possible?

nreimers commented 3 years ago

Yes, it returns a standard huggingface transformers model

federicoBetti commented 3 years ago

Yes, you can first train with MLM on your data: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html

And then use the datasets from here and train your custom model: https://www.sbert.net/examples/training/paraphrases/README.html

Thanks @nreimers for the reply.

I have a similar problem to @vdabravolski, having a domain-specific unlabeled corpus. What do you think about starting from an SBERT model already fine-tuned on paraphrases (e.g. paraphrase-distilroberta-base-v1) and fine-tune it again on domain-specific data using the TSDAE approach showed here (https://www.sbert.net/examples/unsupervised_learning/TSDAE/README.html)?

Which is the effect of domain-specific corpus compared to the approach of @vdabravolski in the first comment?

Thanks!

nreimers commented 3 years ago

@federicoBetti This sadly does not work well in our experiments.

kddubey commented 3 years ago

Hi @nreimers

In my rare case, I have access to a stream of labeled, domain-specific sentence pairs, and would ideally like to pretrain and fine-tune on this data before performing inference on a new batch of sentences. Would you expect a continuous training and inference setup like the one below to yield good clusters?

BERT model -> MLM(domain-specific sentences) -> CosineSimilarityLoss(domain-specific pairs) -> cluster new sentences -> MLM(domain-specific sentences) -> CosineSimilarityLoss(domain-specific pairs) -> cluster new sentences -> ...

I'm concerned that this scheme won't work well based on your last comment. So is there an efficient way to make use of new, labeled data?

Thanks!

nreimers commented 3 years ago

Hi @kddubey

I would run: BERT -> mlm (domain data) -> loss on pairs

When then new data arrives, I would only continue training on the loss with your pairs and skipping MLM.

What you could do is periodic pre-training. E.g. instead of do MLM each time new data arrives, you first collect a large collection of data, then do MLM, then do supervised training on all available pairs

vdabravolski commented 2 years ago

@nreimers have a quick question to you.

Planning to use MPNet baseline model to train LM on my domain. Since i don't have labeled domain dataset for sentence pairs, I'm looking to use unsupervised training like SimCSE. Is it something you can recommend?

nreimers commented 2 years ago

@vdabravolski The SimCSE approach does not work at all to learn domain specific things.

Have a look at: https://arxiv.org/abs/2104.06979

In principle, pre-trained models that have been trained on diverse domains & tasks (like the all-*) are hard to beat and unsupervised approaches are far inferior to them.

vdabravolski commented 2 years ago

@nreimers thanks as always.

So, based on linked paper it looks like i should try:

@nreimers do you happen to have a TFDAE implementation for HF Models?

nreimers commented 2 years ago

https://www.sbert.net/examples/unsupervised_learning/TSDAE/README.html

vdabravolski commented 2 years ago

Closing as questions were addressed. Thanks!

joaocp98662 commented 2 years ago

This is very similar to what I did a few days back. I fine tuned a few of the available bio models in the Huggingface repo. My aim was to fine tune the models for sentence similarity task for biomedical texts. But the problem is for biomedical texts there are not much data available for STS task. Normally you can follow the fine-tuning approach provided in the the SBERT documentation. You just need to run the python script for the fine-tune task and add the model name from the Huggingface repo for whichever model you need to fine-tune.

Hi @pritamdeka! I'm doing something very similar to what you did. Did you write an article or something about your work? I'd love to have some insights about the collections you use, the models you fine-tuned, the process and your results. I'm using the TREC Clinical Trials track collection, but the collection is small and has you've said there are not much data available in the clinical domain. I really appreciate if you can give me some insights/advices on how you did it. Thank you very much!

pritamdeka commented 2 years ago

Hi @joaocp98662 Thanks for reaching out. If I am understanding correctly, you are trying to generate better sentence embedding by fine-tuning over the TREC Clinical Trials track? Or do you want to use already existing biomedical sentence embedding models? If you want to first try existing models then I would suggest to try the models available in my HF repo which I am linking here. And if you still want to fine-tune over your dataset then I would suggest following the S-BERT documentation but before you are free to try the existing models and do lemme know how good or bad the results are. Feel free to ask if you have more questions. Thanks.

https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb

https://huggingface.co/pritamdeka/S-BioBert-snli-multinli-stsb

https://huggingface.co/pritamdeka/S-Biomed-Roberta-snli-multinli-stsb

Feel free to try out these models for biomedical sentence embedding

joaocp98662 commented 2 years ago

Hi @joaocp98662 Thanks for reaching out. If I am understanding correctly, you are trying to generate better sentence embedding by fine-tuning over the TREC Clinical Trials track? Or do you want to use already existing biomedical sentence embedding models? If you want to first try existing models then I would suggest to try the models available in my HF repo which I am linking here. And if you still want to fine-tune over your dataset then I would suggest following the S-BERT documentation but before you are free to try the existing models and do lemme know how good or bad the results are. Feel free to ask if you have more questions. Thanks.

https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb

https://huggingface.co/pritamdeka/S-BioBert-snli-multinli-stsb

https://huggingface.co/pritamdeka/S-Biomed-Roberta-snli-multinli-stsb

Feel free to try out these models for biomedical sentence embedding

Hi @pritamdeka Thank you for your reply and help. I'm doing dense retrieval using SBERT model trained on the MSMARCO. I'm trying to improve my results by fine-tune the model I used to my domain which is clinical trials. From my dataset I have the corpus (clinical trials), the queries (patient information) and the relevance judgments (pairs of queries-documents with grades. 0 - not relevant, 1 - excluded and 2 - eligible). I'm following the SBERT training overview but I'm struggling on how to implement it with my data.

pritamdeka commented 2 years ago

Hi @joaocp98662 Thanks for reaching out. If I am understanding correctly, you are trying to generate better sentence embedding by fine-tuning over the TREC Clinical Trials track? Or do you want to use already existing biomedical sentence embedding models? If you want to first try existing models then I would suggest to try the models available in my HF repo which I am linking here. And if you still want to fine-tune over your dataset then I would suggest following the S-BERT documentation but before you are free to try the existing models and do lemme know how good or bad the results are. Feel free to ask if you have more questions. Thanks. https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb https://huggingface.co/pritamdeka/S-BioBert-snli-multinli-stsb https://huggingface.co/pritamdeka/S-Biomed-Roberta-snli-multinli-stsb Feel free to try out these models for biomedical sentence embedding

Hi @pritamdeka Thank you for your reply and help. I'm doing dense retrieval using SBERT model trained on the MSMARCO. I'm trying to improve my results by fine-tune the model I used to my domain which is clinical trials. From my dataset I have the corpus (clinical trials), the queries (patient information) and the relevance judgments (pairs of queries-documents with grades. 0 - not relevant, 1 - excluded and 2 - eligible). I'm following the SBERT training overview but I'm struggling on how to implement it with my data.

Hi @joaocp98662 Since you are using SBERT models trained on MS-MARCO dataset, I would suggest to follow their training script which caters to this. Also when you want to train over your data make sure the data is in the same format as the MS-MARCO dataset. This will ensure no problems when you run the script. Also you may have to change the code in some places to accommodate your data. That's usually how I do the training on my own data.