Closed chschoenenberger closed 3 years ago
Hi @chschoenenberger Yes, such training should take place before the fine-tuning with the labeled data.
My student Kexin will soon release code & paper that tested different unsupervised pre-training approaches and developed a state-of-the-art approach if you just have sentences.
If you have long documents, this unsupervised approach works quite nice: https://github.com/JohnGiorgi/DeCLUTR
The training corpus for the paraphrase model is not yet open sourced. It consists of various datasets which I merged:
Thanks a lot @nreimers! I'm happy to share our learnings when we're done. Closing this for now.
My student Kexin will soon release code & paper that tested different unsupervised pre-training approaches and developed a state-of-the-art approach if you just have sentences.
Any update on this? Would love to know of the best ways to train sentenceBERT via unsupervised pre-training.
@silburt Have a look here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning
Hi, I started to upload the training script and datasets here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/paraphrases
Hello @nreimers ,
For a similar domain specific task (not from medical domain but uses German language text only) as described by @chschoenenberger, I intend to perform further training followed by fine-tuning on labelled data. I also intend to use the model provided by @PhilipMay as the starting point. For the fine-tuning part, I intend to either use the parameters suggested by @PhilipMay or do a hyper-parameter search myself. However do you have any suggestion for the hyper-parameters for further training? or can you describe the hyper-parametrs you used during unsupervised pre-training?
Hi
We're using Sentence Transformers to build a prototype for a semantic search engine in the medical domain. We currently focus on German (will be multi-lingual) and used the cross-en-de-roberta-sentence-transformer from @PhilipMay for a first version.
In order to improve the performance we were thinking about fine-tuning (or rather perform a continual pre-training) the model on a domain specific corpus. We do not have any labeled data specific for an STS task and would therefore simply use an unsupervised approach. I'm assuming that this fine-tuning would need to take place before the fine-tuning for the STS task. Is that correct? If yes, we would need to re-run the fine-tuning of @nreimers which led to the xlm-r-distilroberta-base-paraphrase-v1 model and also the fine-tuning of Philip as was described #509. @nreimers are you planning on open-sourcing the dataset you used for the fine-tuning of the xlm-roberta-base model? (Or did you already and I wasn't looking in the right place?)
Thanks a lot for your suggestions in advance 😊