SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

64 stars 57 forks source link

Create dataset loader for Onto4All #536

Open SamuelCahyawijaya opened 6 months ago

SamuelCahyawijaya commented 6 months ago

Dataloader name: onto4all/onto4all.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?onto4all

Dataset	onto4all
Description	Onto4All is a subsample of other open source performant conversational datasets. We start with a carefully curated subset of the OpenHermes-2.5-Viet dataset, co-created by @qnguyen3 and @teknium. This dataset is specifically designed to support the training and evaluation of Multilingual language models, such as Vistral-7B-chat and VinaLlama-7B-chat, and is derived from our Supervised Fine-Tuning (SFT) data. We have included Vietnamese here, but will add more languages.
Subsets	-
Languages	vie
Tasks	Question Answering
License	Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage	https://huggingface.co/datasets/ontocord/onto4all
HF URL	https://huggingface.co/datasets/ontocord/onto4all
Paper URL	https://huggingface.co/datasets/ontocord/onto4all

bp-high commented 6 months ago

self-assign

patrickamadeus commented 6 months ago

self-assign