SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for Pangloss Collection #511

Open SamuelCahyawijaya opened 8 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: pangloss_collection/pangloss_collection.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?pangloss_collection

Dataset pangloss_collection
Description The Pangloss Collection is an open archive of audio recordings of underdocumented languages across the world and their dialects, including languages from Cambodia, Laos, Myanmar and Vietnam. About half of all recordings are transcribed, annotated and translated. Many recordings are readings of vocabulary lists or are narratives about the speakers' lives.
Subsets khm, cog, pcb, sxm, tpu, jra, thm, bru, kjg, pkt, nev, oog, hal, tnu, mya, kac, kkh, lhu, aem, crw, cje, kjm, mtq, zng, rgs, tyr, twh, tpo, viekhm
Languages khm, cog, pcb, sxm, tpu, jra, thm, bru, kjg, pkt, nev, oog, hal, tnu, mya, kac, kkh, lhu, aem, crw, cje, kjm, mtq, zng, rgs, tyr, twh, tpo, vie
Tasks Automatic Speech Recognition
License Creative Commons Attribution Non Commercial Share Alike 2.0 (cc-by-nc-sa-2.0)
Homepage https://github.com/CNRS-LACITO/Pangloss_website/
HF URL -
Paper URL https://hal.science/hal-00005544
mrqorib commented 7 months ago

self-assign

mrqorib commented 6 months ago

I'm sorry but I won't have enough time to work on this before the deadline. I'm unassigning myself from this issue for now. I can work on it after the deadline if no one picks this up by then and it's still deemed important to be included.