SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for HPLTDatasets v1.2 #524

Closed SamuelCahyawijaya closed 3 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: hplt/hplt.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?hplt

Dataset hplt
Description The dataset is part of the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT derives monolingual and bilingual datasets from the Internet Archive and CommonCrawl and builds efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).
Subsets -
Languages ind, zlm, tha, mya, fil, vie
Tasks Language Modeling
License Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage https://hplt-project.org/datasets/v1.2
HF URL https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2
Paper URL https://aclanthology.org/2023.eamt-1.61/
akhdanfadh commented 5 months ago

self-assign