IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 60 forks source link

Create dataset loader for NusaParagraph #347

Closed SamuelCahyawijaya closed 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?nusa_alinea

Dataset nusa_alinea
Description NusaAlinea is a human-written paragraph-level datasets which covers 10 local languages in Indonesia. The dataset consists of around 50,000 paragraphs each with around 100 tokens resulting in a total of 6M tokens. The dataset is labelled with topic, emotion, and paragraph type.
License CC-BY-NC-SA 4.0
haryoa commented 1 year ago

self-assign

SamuelCahyawijaya commented 1 year ago

@haryoa, sorry that we need to close this PR, as it turns out we need to restructure the entry in the NusaCatalogue for this dataset 🙏🏻