SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
66 stars 57 forks source link

Create dataset loader for Bactrian-X #424

Closed SamuelCahyawijaya closed 7 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: bactrian_x/bactrian_x.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?bactrian_x

Dataset bactrian_x
Description The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions (alpaca-52k + dolly-15k) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT (gpt-3.5-turbo) to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances). Human evaluations were conducted to evaluate response quality for several languages, with those of interest to SEACrowd being Burmese and Tagalog.
Subsets Burmese, Tagalog, Indonesian, Khmer, Thai, Vietnamese
Languages mya, tgl, ind, khm, tha, vie
Tasks Question Answering, Summarization, Relation Extraction, Text Classification
License Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0)
Homepage https://github.com/mbzuai-nlp/Bactrian-X
HF URL https://huggingface.co/datasets/MBZUAI/Bactrian-X/tree/main/data
Paper URL https://arxiv.org/pdf/2305.15011.pdf
akhdanfadh commented 8 months ago

Hi, the dataset boils down to instruction fine-tuning. Should I just implement the text2text scheme for this with text_1=instruction+input and text_2=output?

github-actions[bot] commented 8 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

akhdanfadh commented 8 months ago

Hmm, I think I need to mention for faster response @sabilmakbar @holylovenia

holylovenia commented 7 months ago

Hi, the dataset boils down to instruction fine-tuning. Should I just implement the text2text scheme for this with text_1=instruction+input and text_2=output?

Hi @akhdanfadh, sorry for the late reply. 🙏 Yes yes, can you use Tasks.INSTRUCTION_TUNING with text_1_name as "instruction" and text_2_name as "response"?

akhdanfadh commented 7 months ago

eng is listed in the Languages code, I think that is not intended(?)

holylovenia commented 7 months ago

eng is listed in the Languages code, I think that is not intended(?)

Thanks for notifying me! I've fixed it now. 👍