UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.23k stars 2.47k forks source link

About paraphrase dataset #1483

Open Wang-Yufei opened 2 years ago

Wang-Yufei commented 2 years ago

Hi~ I want to train a huggingface model with MultipleNegativesRankingLoss, and I find the datasets from the tabel(https://www.sbert.net/examples/training/paraphrases/README.html#datasets) can not be downloaded. Can you give me an example of a program that shows how to change from the original data file to the file required for training?(https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/paraphrases/training.py) Thanks a lot!

nreimers commented 2 years ago

Thanks, I need to update the links.

The datasets can be found here: https://huggingface.co/datasets/sentence-transformers/embedding-training-data

They are in a jsonl format now, not a .tsv. So you might need to update that script to be compatible with jsonl