UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.53k stars 2.41k forks source link

additional train #758

Open ReySadeghi opened 3 years ago

ReySadeghi commented 3 years ago

Hi, is there any script in this repository to additional training a pretrained-Bert on unlabeled data?

nreimers commented 3 years ago

Hi, we will release a new training method for this this month. Currently there are no examples. But you could use the examples from huggingface transformers for pre-training with Masked Language Model.

As mentioned, we will release this month a better method than MLM for pre-training sentence embedding models.

ReySadeghi commented 3 years ago

hi, does it release or not yet?

nreimers commented 3 years ago

not yet, it is currently under out internal review

meesamnaqvi1 commented 3 years ago

Hi, A little background: I am working on French BERT models. My data is a bit un-conventional french (Industrial maintenance logs). I tried using out of the box models (camembert and flaubert) also tried max pooling as there are multiple sentences in each logs but no decent results. One of the multilingual pre-trained models from sentence transformers (distiluse-base-multilingual-cased) is giving decent results. But i want to fine tune camembert and flaubert for my task.

Questions:

nreimers commented 3 years ago

Yes, it should work for any language (even though we just tested it for English).

Code and paper will be released this week (deadline is the weekend)

meesamnaqvi1 commented 3 years ago

And trained model will also support multi sentence and paragraph inputs?

nreimers commented 3 years ago

It can be trained for that

nreimers commented 3 years ago

Hi @ReySadeghi @meesamnaqvi1

The paper was published, where we present a new unsupervised pre-training method: https://arxiv.org/abs/2104.06979

Usage with SBERT will be quite easy. Currently doing the last beauty improvements to the code. I think it will be published on Monday with a new version of SBERT supporting TSDAE.

Stay tuned!

ReySadeghi commented 3 years ago

Hi, thanks for your update . I'm still waiting for the release.

meesamnaqvi1 commented 3 years ago

@nreimers Thanks for the update i am excited to test it on French BERT models

meesamnaqvi1 commented 3 years ago

@nreimers I need some guidance regarding my intention behind this additional training.

Let me give you bit of a background about my problem first:

I am using French BERT models (Camembert and Flaubert) to calculate French paragraph similarity. The problem i am facing is the embedding that i get from these bert models for paragraphs or even small sentence can not be efficiently distinguised using cosine similarity. I tried different techniques to resolve the issue including:

One of the soultions i though might fix the problem was addtional training (as i have unlabeled text) and converting these French models to sentence transformer model as i am getting some decent results on paragraph level using multilinugual sentence transformer model (distiluse-base-multilingual-cased).

I really need to adopt these French models for paragraphs embeddings. Pardon me if the question is basic i am new to NLP and i need suggestion how can i adopt a typical BERT word embeddings model for paragraphs embeddings.

I tired additional training and tried to compute the cosine similarity but problem is still the same. It will be much appriciated if you can give me some suggestion or guidence to resolve this issue.

nreimers commented 3 years ago

Out of the box, BERT does not produce good sentence embeddings.

Have you tested our approach here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/tsdae

meesamnaqvi1 commented 3 years ago

Yes i left the comment after trying this approach. The ouput is still similar, cosine similarity do not have much difference beteween the cases. I trained the model for just few epochs. How many epochs do you recomend i have around 1000 custom data examples.

nreimers commented 3 years ago

1k unlabeled sentences are not much.

I can highly recommend to label sentence pairs and to use this for training.

meesamnaqvi1 commented 3 years ago

If i understand you correctly you are suggestion to use simentic similarity approach like this example. There is French sementic similarity dataset it has around 1000 labeled examples. Do you think it will be enough if i use the training method in the above example (semetic similarity example) with 1000 examples?

nreimers commented 3 years ago

You can try it and see if it works

meesamnaqvi1 commented 3 years ago

Great thank you for the suggestions and quick response. i will try that also.

I also wanted to update you about fresh results i got. I was running few test and in one of the tests i trained the model using TSDAE for 150 epochs (on around 1000 examples) and it seems to now give good similarity at senetnce level

I think previously lower number of epochs could be the reason behind bad results.

Anyways thank you again. Have a nice day!

ReySadeghi commented 3 years ago

Hi, thanks for your new method TSDA, I'm trying it but there is a question:

during training, the vocab.txt file is updating or not? I mean when we train the model on unlabeled sentences, the tokenizer will add new tokens or we should add new token by ourselves?

nreimers commented 3 years ago

No, the tokenizer is not updated.

You could extend the tokenizer for tokens that appear in your data. But it is unclear if and how it could affect the performance

meesamnaqvi1 commented 3 years ago

@nreimers I am trying to training the model using TSDAE on 100k sentences.

I am doing this on a server with 4 TITAN Xp, but since multi GPU is not supported yet. The training starts on single GPU. When i try to fit the model training starts but after some iterations its stops due to memory issue.

I tried different batch sizes, also even tried to reduce the data to 10k sentences. I am following TSDA script on the Readme page.

Can you suggest any solution for this kind of scenario?

nreimers commented 3 years ago

You can try to reduce the max_seq_length. This has the largest impact on the used memory