SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for MDIA #527

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 7 months ago

Dataloader name: mdia/mdia.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mdia

Dataset mdia
Description This is a multilingual benchmark for dialogue generation containing real-life Reddit conversations (parent and response comment pairs) in 46 languages, including Indonesian, Tagalog and Vietnamese. English translations are also provided for comments.
Subsets ind_dialogue, ind_eng, tgl_dialogue, tgl_eng, vie_dialogue, vie_eng
Languages ind, tgl, vie
Tasks Dialogue System
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://github.com/DoctorDream/mDIA
HF URL -
Paper URL https://arxiv.org/pdf/2208.13078.pdf
akhdanfadh commented 6 months ago

self-assign