Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
Please refer to our paper for more details, The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants.
Belebele can be downloaded here which you can download programmatically with the following bash command:
wget --trust-server-names https://dl.fbaipublicfiles.com/belebele/Belebele.zip
The dataset can additionally be used via the HuggingFace repo, such as with the datasets
library.
link
and split
uniquely identifies a passage.link
and split
) and question_number
(either 1 or 2) uniquely identifies a question.dialect
column with the FLORES-200 code (see Languages below)correct_answer_num
is one-indexed (e.g. a value of 2
means mc_answer2
is correct)Thanks to the parallel nature of the dataset and the simplicity of the task, there are many possible settings in which we can evaluate language models. In all evaluation settings, the metric of interest is simple accuracy (# correct / total). Several of the evaluation settings were implemented with several models and discussed in the paper.
Evaluating models on Belebele in English can be done via finetuning, few-shot, or zero-shot. For other target languages, we propose the incomprehensive list of evaluation settings below. Settings that are compatible with evaluating non-English models (monolingual or cross-lingual) are denoted with ^
.
sample_zero_shot_instructions.md
file for specific details on how we evaluated instructed models for the paper.P: <passage> \n Q: <question> \n A: <mc answer 1> \n B: <mc answer 2> \n C: <mc answer 3> \n D: <mc answer 4> \n Answer: <Correct answer letter>
. We perform prediction by picking the answer within [A, B, C, D]
that has the highest probability relatively to the others.x
, question in language y
, and answers in language z
.In addition, there are 83 additional languages in FLORES-200 for which questions were not translated for Belebele. Since the passages exist in those target languages, machine-translating the questions & answers may enable decent evaluation of machine reading comprehension in those languages.
As discussed in the paper, we also provide an assembled training set consisting of samples
The Belebele dataset is intended to be used only as a test set, and not for training or validation. Therefore, for models that require additional task-specific training, we instead propose using an assembled training set consisting of samples from pre-existing multiple-choice QA datasets in English. We considered diverse datasets, and determine the most compatible to be RACE, SciQ, MultiRC, MCTest, MCScript2.0, and ReClor.
For each of the six datasets, we unpack and restructure the passages and questions from their respective formats. We then filter out less suitable samples (e.g. questions with multiple correct answers). In the end, the dataset comprises 67.5k training samples and 3.7k development samples, more than half of which are from RACE. We provide a script (assemble_training_set.py
) to reconstruct this dataset for anyone to perform task finetuning.
Since the training set is a joint sample of other datasets, it is governed by a different license. We do not claim any of that work or datasets to be our own. See the Licenses section.
FLORES-200 Code | English Name | Script | Family |
---|---|---|---|
acm_Arab | Mesopotamian Arabic | Arab | Afro-Asiatic |
afr_Latn | Afrikaans | Latn | Germanic |
als_Latn | Tosk Albanian | Latn | Paleo-Balkanic |
amh_Ethi | Amharic | Ethi | Afro-Asiatic |
apc_Arab | North Levantine Arabic | Arab | Afro-Asiatic |
arb_Arab | Modern Standard Arabic | Arab | Afro-Asiatic |
arb_Latn | Modern Standard Arabic (Romanized) | Latn | Afro-Asiatic |
ars_Arab | Najdi Arabic | Arab | Afro-Asiatic |
ary_arab | Moroccan Arabic | Arab | Afro-Asiatic |
arz_Arab | Egyptian Arabic | Arab | Afro-Asiatic |
asm_Beng | Assamese | Beng | Indo-Aryan |
azj_Latn | North Azerbaijani | Latn | Turkic |
bam_Latn | Bambara | Latn | Mande |
ben_Beng | Bengali | Beng | Indo-Aryan |
ben_Latn^ | Bengali (Romanized) | Latn | Indo-Aryan |
bod_Tibt | Standard Tibetan | Tibt | Sino-Tibetan |
bul_Cyrl | Bulgarian | Cyrl | Balto-Slavic |
cat_Latn | Catalan | Latn | Romance |
ceb_Latn | Cebuano | Latn | Austronesian |
ces_Latn | Czech | Latn | Balto-Slavic |
ckb_Arab | Central Kurdish | Arab | Iranian |
dan_Latn | Danish | Latn | Germanic |
deu_Latn | German | Latn | Germanic |
ell_Grek | Greek | Grek | Hellenic |
eng_Latn | English | Latn | Germanic |
est_Latn | Estonian | Latn | Uralic |
eus_Latn | Basque | Latn | Basque |
fin_Latn | Finnish | Latn | Uralic |
fra_Latn | French | Latn | Romance |
fuv_Latn | Nigerian Fulfulde | Latn | Atlantic-Congo |
gaz_Latn | West Central Oromo | Latn | Afro-Asiatic |
grn_Latn | Guarani | Latn | Tupian |
guj_Gujr | Gujarati | Gujr | Indo-Aryan |
hat_Latn | Haitian Creole | Latn | Atlantic-Congo |
hau_Latn | Hausa | Latn | Afro-Asiatic |
heb_Hebr | Hebrew | Hebr | Afro-Asiatic |
hin_Deva | Hindi | Deva | Indo-Aryan |
hin_Latn^ | Hindi (Romanized) | Latn | Indo-Aryan |
hrv_Latn | Croatian | Latn | Balto-Slavic |
hun_Latn | Hungarian | Latn | Uralic |
hye_Armn | Armenian | Armn | Armenian |
ibo_Latn | Igbo | Latn | Atlantic-Congo |
ilo_Latn | Ilocano | Latn | Austronesian |
ind_Latn | Indonesian | Latn | Austronesian |
isl_Latn | Icelandic | Latn | Germanic |
ita_Latn | Italian | Latn | Romance |
jav_Latn | Javanese | Latn | Austronesian |
jpn_Jpan | Japanese | Jpan | Japonic |
kac_Latn | Jingpho | Latn | Sino-Tibetan |
kan_Knda | Kannada | Knda | Dravidian |
kat_Geor | Georgian | Geor | kartvelian |
kaz_Cyrl | Kazakh | Cyrl | Turkic |
kea_Latn | Kabuverdianu | Latn | Portuguese Creole |
khk_Cyrl | Halh Mongolian | Cyrl | Mongolic |
khm_Khmr | Khmer | Khmr | Austroasiatic |
kin_Latn | Kinyarwanda | Latn | Atlantic-Congo |
kir_Cyrl | Kyrgyz | Cyrl | Turkic |
kor_Hang | Korean | Hang | Koreanic |
lao_Laoo | Lao | Laoo | Kra-Dai |
lin_Latn | Lingala | Latn | Atlantic-Congo |
lit_Latn | Lithuanian | Latn | Balto-Slavic |
lug_Latn | Ganda | Latn | Atlantic-Congo |
luo_Latn | Luo | Latn | Nilo-Saharan |
lvs_Latn | Standard Latvian | Latn | Balto-Slavic |
mal_Mlym | Malayalam | Mlym | Dravidian |
mar_Deva | Marathi | Deva | Indo-Aryan |
mkd_Cyrl | Macedonian | Cyrl | Balto-Slavic |
mlt_Latn | Maltese | Latn | Afro-Asiatic |
mri_Latn | Maori | Latn | Austronesian |
mya_Mymr | Burmese | Mymr | Sino-Tibetan |
nld_Latn | Dutch | Latn | Germanic |
nob_Latn | Norwegian Bokmål | Latn | Germanic |
npi_Deva | Nepali | Deva | Indo-Aryan |
npi_Latn^ | Nepali (Romanized) | Latn | Indo-Aryan |
nso_Latn | Northern Sotho | Latn | Atlantic-Congo |
nya_Latn | Nyanja | Latn | Afro-Asiatic |
ory_Orya | Odia | Orya | Indo-Aryan |
pan_Guru | Eastern Panjabi | Guru | Indo-Aryan |
pbt_Arab | Southern Pashto | Arab | Indo-Aryan |
pes_Arab | Western Persian | Arab | Iranian |
plt_Latn | Plateau Malagasy | Latn | Austronesian |
pol_Latn | Polish | Latn | Balto-Slavic |
por_Latn | Portuguese | Latn | Romance |
ron_Latn | Romanian | Latn | Romance |
rus_Cyrl | Russian | Cyrl | Balto-Slavic |
shn_Mymr | Shan | Mymr | Kra-Dai |
sin_Latn^ | Sinhala (Romanized) | Latn | Indo-Aryan |
sin_Sinh | Sinhala | Sinh | Indo-Aryan |
slk_Latn | Slovak | Latn | Balto-Slavic |
slv_Latn | Slovenian | Latn | Balto-Slavic |
sna_Latn | Shona | Latn | Atlantic-Congo |
snd_Arab | Sindhi | Arab | Indo-Aryan |
som_Latn | Somali | Latn | Afro-Asiatic |
sot_Latn | Southern Sotho | Latn | Atlantic-Congo |
spa_Latn | Spanish | Latn | Romance |
srp_Cyrl | Serbian | Cyrl | Balto-Slavic |
ssw_Latn | Swati | Latn | Atlantic-Congo |
sun_Latn | Sundanese | Latn | Austronesian |
swe_Latn | Swedish | Latn | Germanic |
swh_Latn | Swahili | Latn | Atlantic-Congo |
tam_Taml | Tamil | Taml | Dravidian |
tel_Telu | Telugu | Telu | Dravidian |
tgk_Cyrl | Tajik | Cyrl | Iranian |
tgl_Latn | Tagalog | Latn | Austronesian |
tha_Thai | Thai | Thai | Kra-Dai |
tir_Ethi | Tigrinya | Ethi | Afro-Asiatic |
tsn_Latn | Tswana | Latn | Atlantic-Congo |
tso_Latn | Tsonga | Latn | Afro-Asiatic |
tur_Latn | Turkish | Latn | Turkic |
ukr_Cyrl | Ukrainian | Cyrl | Balto-Slavic |
urd_Arab | Urdu | Arab | Indo-Aryan |
urd_Latn^ | Urdu (Romanized) | Latn | Indo-Aryan |
uzn_Latn | Northern Uzbek | Latn | Turkic |
vie_Latn | Vietnamese | Latn | Austroasiatic |
war_Latn | Waray | Latn | Austronesian |
wol_Latn | Wolof | Latn | Atlantic-Congo |
xho_Latn | Xhosa | Latn | Atlantic-Congo |
yor_Latn | Yoruba | Latn | Atlantic-Congo |
zho_Hans | Chinese (Simplified) | Hans | Sino-Tibetan |
zho_Hant | Chinese (Traditional) | Hant | Sino-Tibetan |
zsm_Latn | Standard Malay | Latn | Austronesian |
zul_Latn | Zulu | Latn | Atlantic-Congo |
^ denotes a language variant not in FLORES-200
The Belebele dataset is licensed under the license found in the LICENSE_CC-BY-SA4.0 file in the root directory of this source tree.
The training set and assembly code is, however, licensed differently. The majority of the training set (data and code) is licensed under CC-BY-NC, however portions of the project are available under separate license terms: NLTK is licensed under the Apache 2.0 license; pandas and NumPy are licensed under the BSD 3-Clause License.
If you use this data in your work, please cite:
@inproceedings{bandarkar-etal-2024-belebele,
title = "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants",
author = "Bandarkar, Lucas and
Liang, Davis and
Muller, Benjamin and
Artetxe, Mikel and
Shukla, Satya Narayan and
Husa, Donald and
Goyal, Naman and
Krishnan, Abhinandan and
Zettlemoyer, Luke and
Khabsa, Madian",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.44",
pages = "749--775",
}