SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for CulturaY #535

Closed SamuelCahyawijaya closed 3 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: culturay/culturay.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?culturay

Dataset culturay
Description CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages From the team that brought you CulutraX, we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the HPLT v1.1 dataset. Please note that HPLT v1.2 has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: Vistral-7B-Chat.
Subsets -
Languages mya, fil, zlm, vie, ind, tha
Tasks Language Modeling
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://huggingface.co/datasets/ontocord/CulturaY
HF URL https://huggingface.co/datasets/ontocord/CulturaY
Paper URL https://huggingface.co/datasets/ontocord/CulturaY
akhdanfadh commented 5 months ago

self-assign