IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
262 stars 62 forks source link

Create dataset loader for KoPI-NLLB #252

Closed SamuelCahyawijaya closed 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?kopi_nllb

Dataset kopi_nllb
Description KopI(Korpus Perayapan Indonesia)-NLLB, is Indonesian family language(aceh,bali,banjar,indonesia,jawa,minang,sunda) only extracted from NLLB Dataset each language set also filtered using some deduplicate technique such as exact hash(md5) dedup technique and minhash LSH neardup
License ODC-BY
SamuelCahyawijaya commented 2 years ago

Duplicated with https://github.com/IndoNLP/nusa-crowd/issues/245