Closes #532 | Add Dataloader chatgpt_malaysian_open_qa

raileymontalan commented 6 months ago

Closes #532

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

yongzx commented 5 months ago

Yea @MJonibek I agree. Right now the available subsets are indeed confusing.

['chatgpt_malaysian_open_qa_common_crawl_qa_source', 'chatgpt_malaysian_open_qa_hansard_qa_source', 'chatgpt_malaysian_open_qa_wikipedia_qa_source', 'chatgpt_malaysian_open_qa_common_crawl_qa_seacrowd_qa', 'chatgpt_malaysian_open_qa_hansard_qa_seacrowd_qa', 'chatgpt_malaysian_open_qa_wikipedia_qa_seacrowd_qa']

I've check the data source and I see what @raileymontalan is doing. Perhaps such split of common crawl, hansard and wikipedia should be merged.

raileymontalan commented 5 months ago

Hi @yongzx and @MJonibek, I've combined the subsets in my latest commit. Let me know if you have any concerns after reviewing it again. Thanks!

SEACrowd / seacrowd-datahub

Closes #532 | Add Dataloader chatgpt_malaysian_open_qa #545

Checkbox