UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.33k stars 2.48k forks source link

Looking for more German paraphrase datasets #1097

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

Hey @nreimers I am looking for German paraphrase datasets. Is there more than this PAWS-X dataset? Many thanks Philip

nreimers commented 3 years ago

Hi @PhilipMay Does it have to be paraphrases? Or can it be any suitable training data for learning embedding models?

In the second case, I only know GermanDPR. But maybe there are some German summarization datasets. I also plan to crawl some (headline, news summary pairs) from Spiegel and Zeit.de - But sadly due to copyright issue these datasets cannot be shared (only the script to get these datasets can be shared). But this type of data would also be quite valuable to train embedding models

PhilipMay commented 3 years ago

Hey @nreimers , thanks! We plan to do text augmentation with them. So paraphrase would be best. Thanks Philip

tagging @sitongye

closing this again