Fine-tune on Domain Corpus

chschoenenberger commented 3 years ago

Hi

We're using Sentence Transformers to build a prototype for a semantic search engine in the medical domain. We currently focus on German (will be multi-lingual) and used the cross-en-de-roberta-sentence-transformer from @PhilipMay for a first version.

In order to improve the performance we were thinking about fine-tuning (or rather perform a continual pre-training) the model on a domain specific corpus. We do not have any labeled data specific for an STS task and would therefore simply use an unsupervised approach. I'm assuming that this fine-tuning would need to take place before the fine-tuning for the STS task. Is that correct? If yes, we would need to re-run the fine-tuning of @nreimers which led to the xlm-r-distilroberta-base-paraphrase-v1 model and also the fine-tuning of Philip as was described #509. @nreimers are you planning on open-sourcing the dataset you used for the fine-tuning of the xlm-roberta-base model? (Or did you already and I wasn't looking in the right place?)

Thanks a lot for your suggestions in advance 😊

nreimers commented 3 years ago

Hi @chschoenenberger Yes, such training should take place before the fine-tuning with the labeled data.

My student Kexin will soon release code & paper that tested different unsupervised pre-training approaches and developed a state-of-the-art approach if you just have sentences.

If you have long documents, this unsupervised approach works quite nice: https://github.com/JohnGiorgi/DeCLUTR

The training corpus for the paraphrase model is not yet open sourced. It consists of various datasets which I merged:

SNLI & MultiNLI (entailment relations)
https://github.com/google-research-datasets/wiki-split
https://github.com/google-research-datasets/sentence-compression
https://github.com/google-research-datasets/wiki-atomic-edits
https://github.com/chridey/altlex/blob/master/data/altlex_train_paraphrases.tsv.zip
https://cs.pomona.edu/~dkauchak/simplification/
quora duplicate questions

chschoenenberger commented 3 years ago

Thanks a lot @nreimers! I'm happy to share our learnings when we're done. Closing this for now.

silburt commented 3 years ago

My student Kexin will soon release code & paper that tested different unsupervised pre-training approaches and developed a state-of-the-art approach if you just have sentences.

Any update on this? Would love to know of the best ways to train sentenceBERT via unsupervised pre-training.

nreimers commented 3 years ago

@silburt Have a look here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning

nreimers commented 3 years ago

Hi, I started to upload the training script and datasets here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/paraphrases

ManishBhandariV commented 3 years ago

Hello @nreimers ,

For a similar domain specific task (not from medical domain but uses German language text only) as described by @chschoenenberger, I intend to perform further training followed by fine-tuning on labelled data. I also intend to use the model provided by @PhilipMay as the starting point. For the fine-tuning part, I intend to either use the parameters suggested by @PhilipMay or do a hyper-parameter search myself. However do you have any suggestion for the hyper-parameters for further training? or can you describe the hyper-parametrs you used during unsupervised pre-training?

UKPLab / sentence-transformers

Fine-tune on Domain Corpus #747