SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for GNOME #513

Closed SamuelCahyawijaya closed 4 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: gnome/gnome.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?gnome

Dataset gnome
Description A parallel corpus of GNOME localization files, which contains the interface text in the GNU Network Object Model Environment (GNOME) and published by GNOME translation teams. Text in this dataset is relatively short and technical.
Subsets -
Languages eng, vie, mya, ind, tha, tgl, zlm, lao
Tasks Machine Translation
License Unknown (unknown)
Homepage https://opus.nlpl.eu/GNOME/corpus/version/GNOME
HF URL -
Paper URL https://aclanthology.org/L12-1246/
akhdanfadh commented 5 months ago

Hey, from what I understand, there is no source language for this dataset. Should I make all possible translation pairs with all languages listed here?

EDIT: Based on discussion #456, I'll implement all possible language pairs.

For parallel MT dataloaders, we agreed upon having a subset for every possible direction with at least 1 SEA language.

akhdanfadh commented 5 months ago

self-assign