SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
66 stars 58 forks source link

Create dataset loader for UIT-ViON #273

Closed SamuelCahyawijaya closed 9 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: uit_vion/uit_vion.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?uit_vion

Dataset uit_vion
Description UIT-ViON (Vietnamese Online Newspaper) is a dataset collected from well-known online newspapers in Vietnamese. The UIT-ViON is an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with 13 different categories for evaluating Vietnamese short text classification. The dataset is split into training, validation, and test sets, each containing 208000, 26000, and 26000 pieces of text, respectively.
Subsets -
Languages vie
Tasks Text Classification
License Unknown (unknown)
Homepage https://github.com/kh4nh12/UIT-ViON-Dataset
HF URL -
Paper URL https://ebooks.iospress.nl/DOI/10.3233/FAIA210036
Alex-HaochenLi commented 10 months ago

self-assign