UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

How to continue the pretraining of SBERT models using MLM? #1502

Open KrishnanJothi opened 2 years ago

KrishnanJothi commented 2 years ago

How to continue the pretraining of Sentence-BERT models using MLM? Is there any documentation or code snippet for this purpose? I would like to continue the pretraining of "all-MiniLM-L6-v2" model using the domain-specific unsupervised data. And then I will evaluate the model on the sentence pairs to obtain the Semantic Texual Similairty scores.

cm2435 commented 2 years ago

Hi Krishan,

ASAIK, all of the models for sentence transformers are hosted on Huggingface also, so you can take your model and do something lilke

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

And then once the model object is loaded, follow the normal MLM training method using something like https://huggingface.co/course/chapter7/6?fw=tf.

That being said, you'd likely be better served using something like TSDAE I'd imagine, https://www.sbert.net/examples/unsupervised_learning/README.html#tsdae

The paper for TSDAE is https://arxiv.org/abs/2104.06979

Which if I recall correctly shows that for unsupervised domain adaption outperforms MLM for cases such as yours.

nreimers commented 2 years ago

As shown in https://arxiv.org/abs/2104.06979 https://arxiv.org/abs/2112.07577

Running MLM afterwards will destroy the model. You would first need to run MLM, then train on labeled / paired data.

What you could try is to use GPL for domain adaptation: https://arxiv.org/abs/2112.07577

KrishnanJothi commented 2 years ago

Thank you Charlie and Nils.

I have few doubts from Nils' comment. Please correct me if my following understanding is not right: 1, Running MLM on the pretrained model will degrade the model. Hence I should first run MLM, then train on labeled / paired data.

  1. Instead of MLM, unsupervised methods like TSDAE and GPL give better performance.

I would like to give more information about my work. 1, In my case, I will be computing similarity between two words (not two sentences). 2, To be specific, it is the data-processing step to generate weak-labels (using some query words) for NER task.

Any suggestions are welcome.

nreimers commented 2 years ago

Using transformer models for words doesn't make sense. Use word2vec here. Only makes sense to use transformers if you have longer text (longer than words and phrases)

KrishnanJothi commented 2 years ago

Yeah, I too thought the same, and used Spacy "en_core_web_sm" model. But the results of the sbert models are more meaningful for my dataset, even for word similarity!

nreimers commented 2 years ago

Maybe try gensim word2vec then