SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

64 stars 57 forks source link

Create dataset loader for UDHR-LID #90

Closed SamuelCahyawijaya closed 9 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: udhr_lid/udhr_lid.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?udhr_lid

Dataset	udhr_lid
Description	The UDHR-LID dataset is a refined version of the Universal Declaration of Human Rights, tailored for language identification tasks. It removes filler texts, repeated phrases, and inaccuracies from the original UDHR, focusing only on cleaned paragraphs. Each entry in the dataset is associated with a specific language, providing long, linguistically rich content. This dataset is particularly useful for non-parallel, language-specific text analysis in natural language processing.
Subsets	sun, ace, mad, lao, cfm, hnj, min, zlm, tha, blt, hni, jav, tdt, cnh, khm, ban, ind, mya, ccp, duu, tet, kkh, bug, vie
Languages	sun, ace, mad, lao, cfm, hnj, min, zlm, tha, blt, hni, jav, tdt, cnh, khm, ban, ind, mya, ccp, duu, tet, kkh, bug, vie
Tasks	Language Identification
License	Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage	https://huggingface.co/datasets/cis-lmu/udhr-lid
HF URL	https://huggingface.co/datasets/cis-lmu/udhr-lid
Paper URL	https://arxiv.org/abs/2310.16248

rmahendra commented 10 months ago

self-assign

muhsatrio commented 10 months ago

self-assign