IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 60 forks source link

Create dataset loader for TaPaCo #371

Open SamuelCahyawijaya opened 10 months ago

SamuelCahyawijaya commented 10 months ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?tapaco

Dataset tapaco
Description A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
License CC-BY 2.0