UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.36k stars 2.49k forks source link

Pre-train SentenceEmbedder using TSDAE from scratch i.e not using existing model like "bert-base-uncased" #2529

Open Jakobhenningjensen opened 8 months ago

Jakobhenningjensen commented 8 months ago

I might misunderstand the example in the docs, but I read the example as how we can finetune an existing model using TSDAE (I'm assuming that since we initialize the model as the "bert-base-uncased" - or is that just to define the architecture of the model and not the weights?).

In the paper you suggest that TSDAE is a great way to pre-train and then fine-tune on labled data, and that is what I'm trying.

My question is; is there a way to train a non-existing model with TSDAE-loss using SentenceTransformer i.e create one "from scratch" or do I have to implement it my self, or doesn't it matter that the model weights e.g from "bert-base-uncased" are not randomly initialized when training TSDAE? I would assume that adapting an existing model would require more training time rather than training one "fresh one", since it would have to "forget" it's old domain and learn the new.

tomaarsen commented 8 months ago

I might misunderstand the example in the docs, but I read the example as how we can finetune an existing model using TSDAE (I'm assuming that since we initialize the model as the "bert-base-uncased" - or is that just to define the architecture of the model and not the weights?).

The various examples are indeed for finetuning an already pre-trained BERT-base model into an embedding model.

The exception is TSDAE as Pre-Training Task, although that doesn't have a code example. However, I suspect that you can adapt any of the other snippets, use a larger dataset, and overwriting the loaded model weights with randomized values. Afterwards, you can then finetune with labeled data to train an even superior model.

My question is; is there a way to train a non-existing model with TSDAE-loss using SentenceTransformer i.e create one "from scratch" or do I have to implement it my self, or doesn't it matter that the model weights e.g from "bert-base-uncased" are not randomly initialized when training TSDAE? I would assume that adapting an existing model would require more training time rather than training one "fresh one", since it would have to "forget" it's old domain and learn the new.

I think that should be possible. You can probably load a fresh model like so:

from transformers import BertModel
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("bert-base-uncased")
model[0].auto_model = BertModel(model[0].auto_model.config)

(and similar for other architectures).

However, I suspect that adapting an existing model will require less training & give better results than training one from scratch. Consider that the pretrained model already has a grasp of your language of interest, while a fresh model understands nothing at all. For pretraining a model all the way from scratch you will require a lot of training data & time.

Perhaps @kwang2049 can add a bit more information and/or correct me.

Jakobhenningjensen commented 8 months ago

@tomaarsen thanks for the swift reply! Agreed that the base model would have some base knowledge which could have a positive impact.

I wonder if we can use the data used for TSDAE in the labeled data i.e say we have movie reviews and some score (1-5). Could we use the review text for TSDAE and then using the labels for fine-tuning, og would that lead to some kind of over fitting i.e do we have to split the data into two? I would assume it could work, but I still wonder if you have experience with it

Jakobhenningjensen commented 7 months ago

@tomaarsen Just to make sure I'm right; we would use TSDAE to pre-train an already trained model, right?

I.e:

1) load pre-trained BERT/SentenceEmebdder 2) Train using TSDAE 3) fine-tune on some labled-data e.g using MNRL

tomaarsen commented 7 months ago

@Jakobhenningjensen That is exactly right. See e.g. https://sbert.net/examples/unsupervised_learning/TSDAE/README.html#tsdae-as-pre-training-task

However, most users load a pretrained BERT or Sentence Transformer model and directly finetuned it on some labeled data with e.g. MNRL. This often already results in excellent performance.

Jakobhenningjensen commented 7 months ago

That is exactly where I'm a bit confused, which is also the reason for the original question above; Is TSDAE used for (a) an already trained model or (b) a "new model"?

My understanding (which also seems to be the paper) was (b) i.e that we train a new model using TSDAE and then fine-tune that on labeled data. But the examples in the documentation uses (a) i.e fine-tunes using an already trained model.

Or do we (in the real world) most often just pick an already trained model and then just fine-tune it with labeled data i.e TSDAE isn't used that often (or only when we wan't to train a generel embedder i.e not for down-stream tasks like classification etc)?

I'm sorry about any confusion and really appreciate your time!

tomaarsen commented 7 months ago

Apologies, I was indeed a bit vague here, because there are a few different 'kinds' of models. There are a handful of different options:

  1. TSDAE as pre-training from scratch, i.e. from a model with randomly initialized weights.
  2. TSDAE as continued pre-training from a foundational model such as BERT, RoBERTa, MPNet, MiniLM, etc. These are models mostly trained using Masked Language Modeling, and are meant to be finetuned further for actual tasks.
  3. Take any existing embedding model, e.g. from https://huggingface.co/models?library=sentence-transformers&sort=trending

All 3 of these approaches result in an "embedding model", i.e. a model whose embeddings should be fairly meaningful (for clustering, classification, retrieval, etc.). With such an embedding model, you can finetune it further for your use cases, which is often done with labeled data.

Most people opt for the third, as it is certainly the easiest. So yes, most users pick an already trained embedding model and just finetune it with labeled data. TSDAE is indeed not used very often.

As for the TSDAE paper, they mention three setups:

As for my advice:

  1. Start with a solid embedding model. E.g. https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1, https://huggingface.co/BAAI/bge-m3, https://huggingface.co/sentence-transformers/all-mpnet-base-v2. Evaluate (formally or just "by feels") how well this works.
  2. If not satisfied with the results, finetune this model by collecting some labeled data. Solid options are MultipleNegativesRankingLoss if you end up with pairs of similar texts or CoSENTLoss or AnglELoss if you have pairs with a similarity score. There's documentation to guide you a bit. Not all finetuned models finetune as nicely, so sometimes it might even work better to finetune a "foundational model" instead, e.g. https://huggingface.co/microsoft/mpnet-base
  3. If still not satisfied with the results, consider TSDAE on unlabeled data in your domain and then finetune it with your labeled data.