Closed raileymontalan closed 5 months ago
Yea @MJonibek I agree. Right now the available subsets are indeed confusing.
['chatgpt_malaysian_open_qa_common_crawl_qa_source', 'chatgpt_malaysian_open_qa_hansard_qa_source', 'chatgpt_malaysian_open_qa_wikipedia_qa_source', 'chatgpt_malaysian_open_qa_common_crawl_qa_seacrowd_qa', 'chatgpt_malaysian_open_qa_hansard_qa_seacrowd_qa', 'chatgpt_malaysian_open_qa_wikipedia_qa_seacrowd_qa']
I've check the data source and I see what @raileymontalan is doing. Perhaps such split of common crawl, hansard and wikipedia should be merged.
Hi @yongzx and @MJonibek, I've combined the subsets in my latest commit. Let me know if you have any concerns after reviewing it again. Thanks!
Closes #532
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.