SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #180 | Implement `IndoMMLU` dataloader #324

Closed chenxwh closed 8 months ago

chenxwh commented 9 months ago

Closes #180.

Found 2 invalid examples in the original datasets: the answer is not in the options. The implemented dataloader skips those 2 examples.

(4223)
{ 'question': 'Keberadaan jumlah belalang warna hijau dan coklat di sawah pada awalnya seimbang. Setelah padi dipanen, ternyata berpengaruh pada jumlah belalang berwarna hijau yang jumlahnya semakin menurun. Hal ini disebabkan karena ….',
 'options': "['A. belalang berwarna hijau tidak dapat berkembang biak', 'B. jumlahnya berkurang karena dimakan oleh pemangsa', 'C. tidak dapat beradaptasi dengan lingkungan sawah', 'D. adanya pengaruh predasi antara sesama belalang']",
 'answer': 'E'}

(14150)
{'question': 'Ibu Meminta Nana menyapu halaman, tetapi ia sedang belajar, kalimat penolakan yang tetap dalah.....',
 'options': "['A. Aku tidak mau ibu', 'B. Maaf ibu aku sedang belajar', 'C. Ibu saja yang menyapu']",
 'answer': 'D'}

Checkbox

chenxwh commented 8 months ago

Thanks for the suggestions! Modified accordingly with added extra langs that have iso code from https://iso639-3.sil.org/code_tables/639/data

chenxwh commented 8 months ago

Thank you for the comments, I have updated accordingly, apart from adding "min" as there is no "Minangkabau language" but only ""Minangkabau culture".

holylovenia commented 8 months ago

Replacing @gentaiscool with @ryanignatius due to no response.

chenxwh commented 8 months ago

oh really? There is no such "subset distinction" in the source datasets. thought @holylovenia suggested we add those just for seacrowd scheme?

holylovenia commented 8 months ago

oh really? There is no such "subset distinction" in the source datasets. thought @holylovenia suggested we add those just for seacrowd scheme?

I think @ryanignatius suggested it so all the subset IDs (i.e., "indommlu" and "indommlu_{lang}") can pass the unit test: python -m tests.test_seacrowd seacrowd/sea_datasets/indommlu/indommlu.py --subset_id={subset_id}. That being said, I think only having indommlu_source as the only source schema is sufficient since all the subsets follow the same schemas.

ryanignatius commented 8 months ago

oh really? There is no such "subset distinction" in the source datasets. thought @holylovenia suggested we add those just for seacrowd scheme?

I think @ryanignatius suggested it so all the subset IDs (i.e., "indommlu" and "indommlu_{lang}") can pass the unit test: python -m tests.test_seacrowd seacrowd/sea_datasets/indommlu/indommlu.py --subset_id={subset_id}. That being said, I think only having indommlu_source as the only source schema is sufficient since all the subsets follow the same schemas.

ah okay got it, sorry I miss the comment to add for seacrowd schema only