Fine-tuning a pre-trained model for classification

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.9k stars 2.44k forks source link

Fine-tuning a pre-trained model for classification #492

Open aliosia opened 3 years ago

aliosia commented 3 years ago

Hi, Thanks a lot for the great SBERT, I wanted to add a softmax layer on top of one of the pre-trained models and build a classifier, but I saw this and thought maybe there is no option in updating the weight of pre-trained model; Is this true?

If not, I wrote a customized Dataset class and called model.tokeinze() in that, just like SentenceDataset. But when I built a dataset and pass it to a DataLoader I got the following error:
RuntimeError: stack expects each tensor to be equal size, but got [295] at entry 0 and [954] at entry 1 I wonder if I should call prepare_for_model after calling tokenize method or what?

Thanks in advance.

nreimers commented 3 years ago

Hi @aliosia You usually get much better results, if you use directly Transformers and fine-tune it on your sentiment classification task.

I don't know who brought this idea up in the community, but it was never a good idea to first map a sentence to an embedding and then using this embedding as (only) feature for a classifier like logistic regression. Classifier working directly on the text data always outperformed these sentence embedding -> classifier constructions.

So for your case I can recommend to fine tune directly for classification and to not use a sentence embedding in between.

aliosia commented 3 years ago

Thanks a lot for your explanation @nreimers I will surely test the other way more, but in my first try, I got better results with SBERT features.

Also the idea of first training with Siamese networks (contrastive loss or triplet loss), in an unsupervised way, and then fine-tuning with the logistic loss for classification is not new, and I remember that near for two years (near 2015) the state of the art face classification model used both loss functions together. Hence, I think starting from a pre-trained network and fine-tuning with a classification loss seems reasonable.

thisisclement commented 3 years ago

Hi @aliosia, any luck with trying with SBERT features fine-tuned for classification so far? Thanks!

davidmosca commented 2 years ago

Hi @nreimers , how would you use S-Bert for a multiclass classification task where the documents to be classified each contain many sentences (say, 30)? Is there an example of how this would be done?

nreimers commented 2 years ago

@davidmosca The CrossEncoder can be used for this. Have a look at the examples in this git

davidmosca commented 2 years ago

Hi @nreimers I have found this example but it only works for pairs of sentences. Is it possible to modify it to classify a full set of sentences? Thanks.

nreimers commented 2 years ago

Just concat the sentences as a single text

davidmosca commented 2 years ago

Hi @nreimers, is there a maximum number of words that I might exceed if I concatenate all sentences? If so, is it possible to change this parameter, or to go for an alternative solution (that preserves all sentences)?

hosjiu1702 commented 2 years ago

Thanks a lot for your explanation @nreimers I will surely test the other way more, but in my first try, I got better results with SBERT features.

Also the idea of first training with Siamese networks (contrastive loss or triplet loss), in an unsupervised way, and then fine-tuning with the logistic loss for classification is not new, and I remember that near for two years (near 2015) the state of the art face classification model used both loss functions together. Hence, I think starting from a pre-trained network and fine-tuning with a classification loss seems reasonable.

It seems LM pretrained on NLI/paraphrase data gives better embeddings for downstream tasks directly.