Domain-specific fine-tuning + OOV words

fcggamou commented 3 years ago

Hi,

I'm training a SentenceTransformer by using an existing spanish MLM model (bert-base-spanish-wwm-uncased) on a smallish labelled dataset. So far it works pretty well, but I'm trying to push it a bit more.

Hoping you can give me some pointers on two points:

Since I'm working on a specific domain on which the bert-base-spanish-wwm-uncased was NOT trained on, and I have a big unlabelled dataset, would it be advisable to fine-tune the MLM model on it somehow, before training the SentenceTransformer? Is there a "SentenceTransformer way" of doing this? Or should I just refer to a generic Language Model Fine-Tuning approach? e.g. using SimpleTransformers.
My vocab has a lot of [UNUSED] entries, and my big unlabelled dataset has a lot of OOV words. I know these are not a huge deal for BERT models, but still I'm curious to see if having them in the vocab would increase the performance. If I simply add words to the vocab [UNUSED] entries, will the SentenceTransformer training learn to use these?

Thanks!

nreimers commented 3 years ago

Hi,

Yes, running MLM can make sense. You can use the standard MLM pre-training. We aim to release soon a new pre-training method specifically for training sentence embeddings
Yes, extending your vocab with the most common OOV tokens can help when you then run MLM.

fcggamou commented 3 years ago

Thanks a lot for your answer!

UKPLab / sentence-transformers

Domain-specific fine-tuning + OOV words #835