SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for Indonesian News Dataset #422

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: indonesian_news_dataset/indonesian_news_dataset.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?indonesian_news_dataset

Dataset indonesian_news_dataset
Description An imbalanced dataset to classify Indonesian News articles. The dataset contains 5 class labels: bola, news, bisnis, tekno, and otomotif. The dataset comprises of around 6k train and 2.5k test examples, with the more prevalent classes (bola and news) having roughly 10x the number of train and test examples than the least prevalent class (otomotif).
Subsets -
Languages ind
Tasks Text Classification, Topic Classification
License Unknown (unknown)
Homepage https://github.com/andreaschandra/indonesian-news
HF URL -
Paper URL -
joanitolopo commented 8 months ago

self-assign