SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for UDHR-LID #90

Closed SamuelCahyawijaya closed 9 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: udhr_lid/udhr_lid.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?udhr_lid

Dataset udhr_lid
Description The UDHR-LID dataset is a refined version of the Universal Declaration of Human Rights, tailored for language identification tasks. It removes filler texts, repeated phrases, and inaccuracies from the original UDHR, focusing only on cleaned paragraphs. Each entry in the dataset is associated with a specific language, providing long, linguistically rich content. This dataset is particularly useful for non-parallel, language-specific text analysis in natural language processing.
Subsets sun, ace, mad, lao, cfm, hnj, min, zlm, tha, blt, hni, jav, tdt, cnh, khm, ban, ind, mya, ccp, duu, tet, kkh, bug, vie
Languages sun, ace, mad, lao, cfm, hnj, min, zlm, tha, blt, hni, jav, tdt, cnh, khm, ban, ind, mya, ccp, duu, tet, kkh, bug, vie
Tasks Language Identification
License Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage https://huggingface.co/datasets/cis-lmu/udhr-lid
HF URL https://huggingface.co/datasets/cis-lmu/udhr-lid
Paper URL https://arxiv.org/abs/2310.16248
rmahendra commented 10 months ago

self-assign

muhsatrio commented 10 months ago

self-assign