IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Create dataset loader for Korpus Nusantara #224

Closed SamuelCahyawijaya closed 1 year ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?korpus_nusantara

Dataset korpus_nusantara
Description The dataset is a combination of multiple machine translation works from the author, Herry Sujaini, covering Indonesian to 25 local dialects in Indonesia. Since not all dialects have ISO639-3 standard coding, as agreed with Pak Herry , we decided to group the dataset into the closest language family, i.e.: Javanese, Dayak, Buginese, Sundanese, Madurese, Banjar, Batak Toba, Khek, Malay, Minangkabau, and Tiociu.
License Unknown
yana-xuyan commented 2 years ago

self-assign