Closed vdabravolski closed 2 years ago
This is very similar to what I did a few days back. I fine tuned a few of the available bio models in the Huggingface repo. My aim was to fine tune the models for sentence similarity task for biomedical texts. But the problem is for biomedical texts there are not much data available for STS task. Normally you can follow the fine-tuning approach provided in the the SBERT documentation. You just need to run the python script for the fine-tune task and add the model name from the Huggingface repo for whichever model you need to fine-tune.
Yes, you can first train with MLM on your data: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html
And then use the datasets from here and train your custom model: https://www.sbert.net/examples/training/paraphrases/README.html
Note: NLI data is rather narrow, the embeddings you can derive from it are not the best. Better to train broadly on the datasets from the paraphrase page. Yields much better embeddings.
Yes, you can first train with MLM on your data: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html
And then use the datasets from here and train your custom model: https://www.sbert.net/examples/training/paraphrases/README.html
Note: NLI data is rather narrow, the embeddings you can derive from it are not the best. Better to train broadly on the datasets from the paraphrase page. Yields much better embeddings.
Thanks @nreimers, precisely what i was looking for.
Sorry for the really naive question, but I'm wondering if the MLM code spits our a 'Huggingface' model that can then be loaded with their native code? Reason being I want to train a MLM model with 'prompting' then intiialise it with a 'Zero Shot Classification' head / class of theirs - is that possible?
Yes, it returns a standard huggingface transformers model
Yes, you can first train with MLM on your data: https://www.sbert.net/examples/unsupervised_learning/MLM/README.html
And then use the datasets from here and train your custom model: https://www.sbert.net/examples/training/paraphrases/README.html
Thanks @nreimers for the reply.
I have a similar problem to @vdabravolski, having a domain-specific unlabeled corpus. What do you think about starting from an SBERT model already fine-tuned on paraphrases (e.g. paraphrase-distilroberta-base-v1) and fine-tune it again on domain-specific data using the TSDAE approach showed here (https://www.sbert.net/examples/unsupervised_learning/TSDAE/README.html)?
Which is the effect of domain-specific corpus compared to the approach of @vdabravolski in the first comment?
Thanks!
@federicoBetti This sadly does not work well in our experiments.
Hi @nreimers
In my rare case, I have access to a stream of labeled, domain-specific sentence pairs, and would ideally like to pretrain and fine-tune on this data before performing inference on a new batch of sentences. Would you expect a continuous training and inference setup like the one below to yield good clusters?
BERT model -> MLM(domain-specific sentences) -> CosineSimilarityLoss(domain-specific pairs) -> cluster new sentences -> MLM(domain-specific sentences) -> CosineSimilarityLoss(domain-specific pairs) -> cluster new sentences -> ...
I'm concerned that this scheme won't work well based on your last comment. So is there an efficient way to make use of new, labeled data?
Thanks!
Hi @kddubey
I would run: BERT -> mlm (domain data) -> loss on pairs
When then new data arrives, I would only continue training on the loss with your pairs and skipping MLM.
What you could do is periodic pre-training. E.g. instead of do MLM each time new data arrives, you first collect a large collection of data, then do MLM, then do supervised training on all available pairs
@nreimers have a quick question to you.
Planning to use MPNet baseline model to train LM on my domain. Since i don't have labeled domain dataset for sentence pairs, I'm looking to use unsupervised training like SimCSE. Is it something you can recommend?
@vdabravolski The SimCSE approach does not work at all to learn domain specific things.
Have a look at: https://arxiv.org/abs/2104.06979
In principle, pre-trained models that have been trained on diverse domains & tasks (like the all-*) are hard to beat and unsupervised approaches are far inferior to them.
@nreimers thanks as always.
So, based on linked paper it looks like i should try:
@nreimers do you happen to have a TFDAE implementation for HF Models?
Closing as questions were addressed. Thanks!
This is very similar to what I did a few days back. I fine tuned a few of the available bio models in the Huggingface repo. My aim was to fine tune the models for sentence similarity task for biomedical texts. But the problem is for biomedical texts there are not much data available for STS task. Normally you can follow the fine-tuning approach provided in the the SBERT documentation. You just need to run the python script for the fine-tune task and add the model name from the Huggingface repo for whichever model you need to fine-tune.
Hi @pritamdeka! I'm doing something very similar to what you did. Did you write an article or something about your work? I'd love to have some insights about the collections you use, the models you fine-tuned, the process and your results. I'm using the TREC Clinical Trials track collection, but the collection is small and has you've said there are not much data available in the clinical domain. I really appreciate if you can give me some insights/advices on how you did it. Thank you very much!
Hi @joaocp98662 Thanks for reaching out. If I am understanding correctly, you are trying to generate better sentence embedding by fine-tuning over the TREC Clinical Trials track? Or do you want to use already existing biomedical sentence embedding models? If you want to first try existing models then I would suggest to try the models available in my HF repo which I am linking here. And if you still want to fine-tune over your dataset then I would suggest following the S-BERT documentation but before you are free to try the existing models and do lemme know how good or bad the results are. Feel free to ask if you have more questions. Thanks.
https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb
https://huggingface.co/pritamdeka/S-BioBert-snli-multinli-stsb
https://huggingface.co/pritamdeka/S-Biomed-Roberta-snli-multinli-stsb
Feel free to try out these models for biomedical sentence embedding
Hi @joaocp98662 Thanks for reaching out. If I am understanding correctly, you are trying to generate better sentence embedding by fine-tuning over the TREC Clinical Trials track? Or do you want to use already existing biomedical sentence embedding models? If you want to first try existing models then I would suggest to try the models available in my HF repo which I am linking here. And if you still want to fine-tune over your dataset then I would suggest following the S-BERT documentation but before you are free to try the existing models and do lemme know how good or bad the results are. Feel free to ask if you have more questions. Thanks.
https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb
https://huggingface.co/pritamdeka/S-BioBert-snli-multinli-stsb
https://huggingface.co/pritamdeka/S-Biomed-Roberta-snli-multinli-stsb
Feel free to try out these models for biomedical sentence embedding
Hi @pritamdeka Thank you for your reply and help. I'm doing dense retrieval using SBERT model trained on the MSMARCO. I'm trying to improve my results by fine-tune the model I used to my domain which is clinical trials. From my dataset I have the corpus (clinical trials), the queries (patient information) and the relevance judgments (pairs of queries-documents with grades. 0 - not relevant, 1 - excluded and 2 - eligible). I'm following the SBERT training overview but I'm struggling on how to implement it with my data.
Hi @joaocp98662 Thanks for reaching out. If I am understanding correctly, you are trying to generate better sentence embedding by fine-tuning over the TREC Clinical Trials track? Or do you want to use already existing biomedical sentence embedding models? If you want to first try existing models then I would suggest to try the models available in my HF repo which I am linking here. And if you still want to fine-tune over your dataset then I would suggest following the S-BERT documentation but before you are free to try the existing models and do lemme know how good or bad the results are. Feel free to ask if you have more questions. Thanks. https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb https://huggingface.co/pritamdeka/S-BioBert-snli-multinli-stsb https://huggingface.co/pritamdeka/S-Biomed-Roberta-snli-multinli-stsb Feel free to try out these models for biomedical sentence embedding
Hi @pritamdeka Thank you for your reply and help. I'm doing dense retrieval using SBERT model trained on the MSMARCO. I'm trying to improve my results by fine-tune the model I used to my domain which is clinical trials. From my dataset I have the corpus (clinical trials), the queries (patient information) and the relevance judgments (pairs of queries-documents with grades. 0 - not relevant, 1 - excluded and 2 - eligible). I'm following the SBERT training overview but I'm struggling on how to implement it with my data.
Hi @joaocp98662 Since you are using SBERT models trained on MS-MARCO dataset, I would suggest to follow their training script which caters to this. Also when you want to train over your data make sure the data is in the same format as the MS-MARCO dataset. This will ensure no problems when you run the script. Also you may have to change the code in some places to accommodate your data. That's usually how I do the training on my own data.
Hi,
I'd like to use SBERT model architecture for document similarity and topic modelling tasks. However, my data corpus is fairly specific to domain, and I suspect that SBERT will underperform as it was trained on generic WIki/Library corpuses. So, I wonder if there are any recommendation around fine-tuning of underlying language model for SBERT.
I envision that the overall process will be following:
Curious to hear thought on the approach and problem definition.