SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for QED #512

Closed SamuelCahyawijaya closed 3 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: qed/qed.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?qed

Dataset qed
Description QED - The QCRI Educational Domain Corpus (formerly QCRI AMARA Corpus) is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. It's developed by Qatar Computing Research Institute, Arabic Language Technologies Group. Along with English, it covers multiple SEA languages, such as vi (Vietnamese), my (Burnmese), jv (Javanese), id (Indonesia), th (Thai), tl (Tagalog), ms (Malaysia).
Subsets -
Languages eng, vie, tha, mya, jav, ind, tgl, zlm
Tasks Machine Translation, Language Modeling
License Other (other)
Homepage https://opus.nlpl.eu/QED/corpus/version/QED
HF URL -
Paper URL https://aclanthology.org/L14-1675/
patrickamadeus commented 5 months ago

self-assign