SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for OpenAssistant #613

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: oasst2/oasst2.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?oasst2

Dataset oasst2
Description In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST), a human-generated, human-annotated assistant-style conversation corpus. This dataset contains message trees. Each message tree has an initial prompt message as the root node, which can have multiple child messages as replies, and these child messages can have multiple replies.
Subsets oasst1, oasst2
Languages tha, vie, ind
Tasks Chatbot
License Apache license 2.0 (apache-2.0)
Homepage https://github.com/LAION-AI/Open-Assistant
HF URL https://huggingface.co/datasets/OpenAssistant/oasst2
Paper URL -