SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for Multilingual-ALPACA #531

Closed SamuelCahyawijaya closed 4 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: multilingual_alpaca/multilingual_alpaca.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?multilingual_alpaca

Dataset multilingual_alpaca
Description For multilingual general task instruction data, we incorporate ALPACA dataset (Taori et al., 2023), which consists of 52k English questions and corresponding response, and we obtain its foreign version with in-house translation engine. The six languages are Arabic (Ar), Greek (El), Hindi (Hi), Turkish (Tr), Vietnamese (Vi), Chinese (Zh).
Subsets -
Languages vie
Tasks Chatbot
License Unknown (unknown)
Homepage https://github.com/NJUNLP/x-LLM
HF URL -
Paper URL https://arxiv.org/pdf/2308.04948
akhdanfadh commented 5 months ago

self-assign