SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Closes #532 | Add Dataloader chatgpt_malaysian_open_qa #545

Closed raileymontalan closed 5 months ago

raileymontalan commented 6 months ago

Closes #532

Checkbox

yongzx commented 5 months ago

Yea @MJonibek I agree. Right now the available subsets are indeed confusing.

['chatgpt_malaysian_open_qa_common_crawl_qa_source', 'chatgpt_malaysian_open_qa_hansard_qa_source', 'chatgpt_malaysian_open_qa_wikipedia_qa_source', 'chatgpt_malaysian_open_qa_common_crawl_qa_seacrowd_qa', 'chatgpt_malaysian_open_qa_hansard_qa_seacrowd_qa', 'chatgpt_malaysian_open_qa_wikipedia_qa_seacrowd_qa']

I've check the data source and I see what @raileymontalan is doing. Perhaps such split of common crawl, hansard and wikipedia should be merged.

raileymontalan commented 5 months ago

Hi @yongzx and @MJonibek, I've combined the subsets in my latest commit. Let me know if you have any concerns after reviewing it again. Thanks!