SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
66 stars 58 forks source link

Create dataset loader for Belebele #7

Closed SamuelCahyawijaya closed 11 months ago

SamuelCahyawijaya commented 1 year ago

Dataloader name: belebele/belebele.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?belebele

Dataset belebele
Description Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
Subsets ceb_Latn, ilo_Latn, ind_Latn, jav_Latn, kac_Latn, khm_Khmr, lao_Laoo, mya_Mymr, shn_Mymr, sun_Latn, tgl_Latn, tha_Thai, vie_Latn, war_Latn, zsm_Latn
Languages ceb, ilo, ind, jav, kac, khm, lao, mya, shn, sun, tgl, vie, war, zsm
Tasks Question Answering
License Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0)
Homepage https://github.com/facebookresearch/belebele
HF URL https://huggingface.co/datasets/facebook/belebele
Paper URL https://arxiv.org/pdf/2308.16884v1.pdf
gagan3012 commented 1 year ago

Can I take this?

holylovenia commented 1 year ago

@gagan3012 Yes! You can assign yourself in the project and follow the guide in here. 😄

mnjkhtri commented 1 year ago

self-assign