Open ReySadeghi opened 3 years ago
Hi, we will release a new training method for this this month. Currently there are no examples. But you could use the examples from huggingface transformers for pre-training with Masked Language Model.
As mentioned, we will release this month a better method than MLM for pre-training sentence embedding models.
hi, does it release or not yet?
not yet, it is currently under out internal review
Hi, A little background: I am working on French BERT models. My data is a bit un-conventional french (Industrial maintenance logs). I tried using out of the box models (camembert and flaubert) also tried max pooling as there are multiple sentences in each logs but no decent results. One of the multilingual pre-trained models from sentence transformers (distiluse-base-multilingual-cased) is giving decent results. But i want to fine tune camembert and flaubert for my task.
Questions:
Yes, it should work for any language (even though we just tested it for English).
Code and paper will be released this week (deadline is the weekend)
And trained model will also support multi sentence and paragraph inputs?
It can be trained for that
Hi @ReySadeghi @meesamnaqvi1
The paper was published, where we present a new unsupervised pre-training method: https://arxiv.org/abs/2104.06979
Usage with SBERT will be quite easy. Currently doing the last beauty improvements to the code. I think it will be published on Monday with a new version of SBERT supporting TSDAE.
Stay tuned!
Hi, thanks for your update . I'm still waiting for the release.
@nreimers Thanks for the update i am excited to test it on French BERT models
@nreimers I need some guidance regarding my intention behind this additional training.
Let me give you bit of a background about my problem first:
I am using French BERT models (Camembert and Flaubert) to calculate French paragraph similarity. The problem i am facing is the embedding that i get from these bert models for paragraphs or even small sentence can not be efficiently distinguised using cosine similarity. I tried different techniques to resolve the issue including:
One of the soultions i though might fix the problem was addtional training (as i have unlabeled text) and converting these French models to sentence transformer model as i am getting some decent results on paragraph level using multilinugual sentence transformer model (distiluse-base-multilingual-cased).
I really need to adopt these French models for paragraphs embeddings. Pardon me if the question is basic i am new to NLP and i need suggestion how can i adopt a typical BERT word embeddings model for paragraphs embeddings.
I tired additional training and tried to compute the cosine similarity but problem is still the same. It will be much appriciated if you can give me some suggestion or guidence to resolve this issue.
Out of the box, BERT does not produce good sentence embeddings.
Have you tested our approach here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/tsdae
Yes i left the comment after trying this approach. The ouput is still similar, cosine similarity do not have much difference beteween the cases. I trained the model for just few epochs. How many epochs do you recomend i have around 1000 custom data examples.
1k unlabeled sentences are not much.
I can highly recommend to label sentence pairs and to use this for training.
If i understand you correctly you are suggestion to use simentic similarity approach like this example. There is French sementic similarity dataset it has around 1000 labeled examples. Do you think it will be enough if i use the training method in the above example (semetic similarity example) with 1000 examples?
You can try it and see if it works
Great thank you for the suggestions and quick response. i will try that also.
I also wanted to update you about fresh results i got. I was running few test and in one of the tests i trained the model using TSDAE for 150 epochs (on around 1000 examples) and it seems to now give good similarity at senetnce level
I think previously lower number of epochs could be the reason behind bad results.
Anyways thank you again. Have a nice day!
Hi, thanks for your new method TSDA, I'm trying it but there is a question:
during training, the vocab.txt file is updating or not? I mean when we train the model on unlabeled sentences, the tokenizer will add new tokens or we should add new token by ourselves?
No, the tokenizer is not updated.
You could extend the tokenizer for tokens that appear in your data. But it is unclear if and how it could affect the performance
@nreimers I am trying to training the model using TSDAE on 100k sentences.
I am doing this on a server with 4 TITAN Xp, but since multi GPU is not supported yet. The training starts on single GPU. When i try to fit the model training starts but after some iterations its stops due to memory issue.
I tried different batch sizes, also even tried to reduce the data to 10k sentences. I am following TSDA script on the Readme page.
Can you suggest any solution for this kind of scenario?
You can try to reduce the max_seq_length. This has the largest impact on the used memory
Hi, is there any script in this repository to additional training a pretrained-Bert on unlabeled data?