Closes #180 | Implement `IndoMMLU` dataloader

chenxwh commented 9 months ago

Closes #180.

Found 2 invalid examples in the original datasets: the answer is not in the options. The implemented dataloader skips those 2 examples.

(4223)
{ 'question': 'Keberadaan jumlah belalang warna hijau dan coklat di sawah pada awalnya seimbang. Setelah padi dipanen, ternyata berpengaruh pada jumlah belalang berwarna hijau yang jumlahnya semakin menurun. Hal ini disebabkan karena ….',
 'options': "['A. belalang berwarna hijau tidak dapat berkembang biak', 'B. jumlahnya berkurang karena dimakan oleh pemangsa', 'C. tidak dapat beradaptasi dengan lingkungan sawah', 'D. adanya pengaruh predasi antara sesama belalang']",
 'answer': 'E'}

(14150)
{'question': 'Ibu Meminta Nana menyapu halaman, tetapi ia sedang belajar, kalimat penolakan yang tetap dalah.....',
 'options': "['A. Aku tidak mau ibu', 'B. Maaf ibu aku sedang belajar', 'C. Ibu saja yang menyapu']",
 'answer': 'D'}

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

chenxwh commented 8 months ago

Thanks for the suggestions! Modified accordingly with added extra langs that have iso code from https://iso639-3.sil.org/code_tables/639/data

chenxwh commented 8 months ago

Thank you for the comments, I have updated accordingly, apart from adding "min" as there is no "Minangkabau language" but only ""Minangkabau culture".

holylovenia commented 8 months ago

Replacing @gentaiscool with @ryanignatius due to no response.

chenxwh commented 8 months ago

oh really? There is no such "subset distinction" in the source datasets. thought @holylovenia suggested we add those just for seacrowd scheme?

holylovenia commented 8 months ago

oh really? There is no such "subset distinction" in the source datasets. thought @holylovenia suggested we add those just for seacrowd scheme?

I think @ryanignatius suggested it so all the subset IDs (i.e., "indommlu" and "indommlu_{lang}") can pass the unit test: python -m tests.test_seacrowd seacrowd/sea_datasets/indommlu/indommlu.py --subset_id={subset_id}. That being said, I think only having indommlu_source as the only source schema is sufficient since all the subsets follow the same schemas.

ryanignatius commented 8 months ago

oh really? There is no such "subset distinction" in the source datasets. thought @holylovenia suggested we add those just for seacrowd scheme?

I think @ryanignatius suggested it so all the subset IDs (i.e., "indommlu" and "indommlu_{lang}") can pass the unit test: python -m tests.test_seacrowd seacrowd/sea_datasets/indommlu/indommlu.py --subset_id={subset_id}. That being said, I think only having indommlu_source as the only source schema is sufficient since all the subsets follow the same schemas.

ah okay got it, sorry I miss the comment to add for seacrowd schema only

SEACrowd / seacrowd-datahub

Closes #180 | Implement `IndoMMLU` dataloader #324

Checkbox