SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for UDHR #82

Closed SamuelCahyawijaya closed 9 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: udhr/udhr.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?udhr

Dataset udhr
Description he Universal Declaration of Human Rights (UDHR) is a milestone document in the history of human rights. Drafted by representatives with different legal and cultural backgrounds from all regions of the world, it set out, for the first time, fundamental human rights to be universally protected. The Declaration was adopted by the UN General Assembly in Paris on 10 December 1948 during its 183rd plenary meeting.
Subsets ind, ilo, mnw, tet, pam, lus, vie, min, lao, tgl, hni, ceb, jav, shn, bcl, hil, sun, ban, zlm, cnh, kkh, cfm, ctd, duu, tdt, tha, bug, mad, mya, khm, war, ace, hnj, blt, hlt
Languages ind, ilo, mnw, tet, pam, lus, vie, min, lao, tgl, hni, ceb, jav, shn, bcl, hil, sun, ban, zlm, cnh, kkh, cfm, ctd, duu, tdt, tha, bug, mad, mya, khm, war, ace, hnj, blt, hlt
Tasks Language Modeling
License Unknown (unknown)
Homepage https://huggingface.co/datasets/udhr?row=1
HF URL https://huggingface.co/datasets/udhr?row=1
Paper URL https://unicode.org/udhr/translations.html
SamuelCahyawijaya commented 10 months ago

For this dataset, please make the dataset into multiple subsets, one for each language, with a single document on each subset.

IvanHalimP commented 10 months ago

self-assign

IvanHalimP commented 10 months ago

Hi, I'd like to report that the following language codes :

"abs", "cja", "fil", "iba", "dbj"

are nowhere to be found in the dataset.