Open Jakobhenningjensen opened 8 months ago
I might misunderstand the example in the docs, but I read the example as how we can finetune an existing model using TSDAE (I'm assuming that since we initialize the model as the "bert-base-uncased" - or is that just to define the architecture of the model and not the weights?).
The various examples are indeed for finetuning an already pre-trained BERT-base model into an embedding model.
The exception is TSDAE as Pre-Training Task, although that doesn't have a code example. However, I suspect that you can adapt any of the other snippets, use a larger dataset, and overwriting the loaded model weights with randomized values. Afterwards, you can then finetune with labeled data to train an even superior model.
My question is; is there a way to train a non-existing model with TSDAE-loss using SentenceTransformer i.e create one "from scratch" or do I have to implement it my self, or doesn't it matter that the model weights e.g from "bert-base-uncased" are not randomly initialized when training TSDAE? I would assume that adapting an existing model would require more training time rather than training one "fresh one", since it would have to "forget" it's old domain and learn the new.
I think that should be possible. You can probably load a fresh model like so:
from transformers import BertModel
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("bert-base-uncased")
model[0].auto_model = BertModel(model[0].auto_model.config)
(and similar for other architectures).
However, I suspect that adapting an existing model will require less training & give better results than training one from scratch. Consider that the pretrained model already has a grasp of your language of interest, while a fresh model understands nothing at all. For pretraining a model all the way from scratch you will require a lot of training data & time.
Perhaps @kwang2049 can add a bit more information and/or correct me.
@tomaarsen thanks for the swift reply! Agreed that the base model would have some base knowledge which could have a positive impact.
I wonder if we can use the data used for TSDAE in the labeled data i.e say we have movie reviews and some score (1-5). Could we use the review text for TSDAE and then using the labels for fine-tuning, og would that lead to some kind of over fitting i.e do we have to split the data into two? I would assume it could work, but I still wonder if you have experience with it
@tomaarsen Just to make sure I'm right; we would use TSDAE to pre-train an already trained model, right?
I.e:
1) load pre-trained BERT/SentenceEmebdder 2) Train using TSDAE 3) fine-tune on some labled-data e.g using MNRL
@Jakobhenningjensen That is exactly right. See e.g. https://sbert.net/examples/unsupervised_learning/TSDAE/README.html#tsdae-as-pre-training-task
However, most users load a pretrained BERT or Sentence Transformer model and directly finetuned it on some labeled data with e.g. MNRL. This often already results in excellent performance.
That is exactly where I'm a bit confused, which is also the reason for the original question above; Is TSDAE used for (a) an already trained model or (b) a "new model"?
My understanding (which also seems to be the paper) was (b) i.e that we train a new model using TSDAE and then fine-tune that on labeled data. But the examples in the documentation uses (a) i.e fine-tunes using an already trained model.
Or do we (in the real world) most often just pick an already trained model and then just fine-tune it with labeled data i.e TSDAE isn't used that often (or only when we wan't to train a generel embedder i.e not for down-stream tasks like classification etc)?
I'm sorry about any confusion and really appreciate your time!
Apologies, I was indeed a bit vague here, because there are a few different 'kinds' of models. There are a handful of different options:
All 3 of these approaches result in an "embedding model", i.e. a model whose embeddings should be fairly meaningful (for clustering, classification, retrieval, etc.). With such an embedding model, you can finetune it further for your use cases, which is often done with labeled data.
Most people opt for the third, as it is certainly the easiest. So yes, most users pick an already trained embedding model and just finetune it with labeled data. TSDAE is indeed not used very often.
As for the TSDAE paper, they mention three setups:
As for my advice:
I might misunderstand the example in the docs, but I read the example as how we can finetune an existing model using TSDAE (I'm assuming that since we initialize the model as the "bert-base-uncased" - or is that just to define the architecture of the model and not the weights?).
In the paper you suggest that TSDAE is a great way to pre-train and then fine-tune on labled data, and that is what I'm trying.
My question is; is there a way to train a non-existing model with TSDAE-loss using
SentenceTransformer
i.e create one "from scratch" or do I have to implement it my self, or doesn't it matter that the model weights e.g from "bert-base-uncased" are not randomly initialized when training TSDAE? I would assume that adapting an existing model would require more training time rather than training one "fresh one", since it would have to "forget" it's old domain and learn the new.