Create dataset loader for Bactrian-X

SamuelCahyawijaya commented 9 months ago

Dataloader name: bactrian_x/bactrian_x.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?bactrian_x

Dataset	bactrian_x
Description	The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions (alpaca-52k + dolly-15k) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT (gpt-3.5-turbo) to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances). Human evaluations were conducted to evaluate response quality for several languages, with those of interest to SEACrowd being Burmese and Tagalog.
Subsets	Burmese, Tagalog, Indonesian, Khmer, Thai, Vietnamese
Languages	mya, tgl, ind, khm, tha, vie
Tasks	Question Answering, Summarization, Relation Extraction, Text Classification
License	Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0)
Homepage	https://github.com/mbzuai-nlp/Bactrian-X
HF URL	https://huggingface.co/datasets/MBZUAI/Bactrian-X/tree/main/data
Paper URL	https://arxiv.org/pdf/2305.15011.pdf

akhdanfadh commented 8 months ago

Hi, the dataset boils down to instruction fine-tuning. Should I just implement the text2text scheme for this with text_1=instruction+input and text_2=output?

github-actions[bot] commented 8 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

akhdanfadh commented 8 months ago

Hmm, I think I need to mention for faster response @sabilmakbar @holylovenia

holylovenia commented 7 months ago

Hi, the dataset boils down to instruction fine-tuning. Should I just implement the text2text scheme for this with text_1=instruction+input and text_2=output?

Hi @akhdanfadh, sorry for the late reply. 🙏 Yes yes, can you use Tasks.INSTRUCTION_TUNING with text_1_name as "instruction" and text_2_name as "response"?

akhdanfadh commented 7 months ago

eng is listed in the Languages code, I think that is not intended(?)

holylovenia commented 7 months ago

eng is listed in the Languages code, I think that is not intended(?)

Thanks for notifying me! I've fixed it now. 👍

SEACrowd / seacrowd-datahub

Create dataset loader for Bactrian-X #424