SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for Onto4All #536

Open SamuelCahyawijaya opened 6 months ago

SamuelCahyawijaya commented 6 months ago

Dataloader name: onto4all/onto4all.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?onto4all

Dataset onto4all
Description Onto4All is a subsample of other open source performant conversational datasets. We start with a carefully curated subset of the OpenHermes-2.5-Viet dataset, co-created by @qnguyen3 and @teknium. This dataset is specifically designed to support the training and evaluation of Multilingual language models, such as Vistral-7B-chat and VinaLlama-7B-chat, and is derived from our Supervised Fine-Tuning (SFT) data. We have included Vietnamese here, but will add more languages.
Subsets -
Languages vie
Tasks Question Answering
License Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage https://huggingface.co/datasets/ontocord/onto4all
HF URL https://huggingface.co/datasets/ontocord/onto4all
Paper URL https://huggingface.co/datasets/ontocord/onto4all
bp-high commented 6 months ago

self-assign

patrickamadeus commented 6 months ago

self-assign