SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Create dataset loader for MTOB #572

Open SamuelCahyawijaya opened 3 months ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: mtob/mtob.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mtob

Dataset mtob
Description The Machine Translation from One Book (MTOB) dataset is drawn entirely from Visser (2022), a collection of documentation for the Kalamang language based on 11 months of fieldwork conducted in Mas over the course of four years. It consists of three sets of resources: (1) the body of the grammar book, (2) a bilingual wordlist, and (3) an extremely small corpus of parallel Kalamang-English sentences.
Subsets Grammar Book, Bilingual Wordlist, Parallel Sentence Corpus
Languages kgv, eng
Tasks Machine Translation, Language Modeling, Word lists
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://github.com/lukemelas/mtob
HF URL -
Paper URL https://arxiv.org/abs/2309.16575
akhdanfadh commented 3 months ago

self-assign