SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for ParaNames #515

Open SamuelCahyawijaya opened 3 months ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: paranames/paranames.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?paranames

Dataset paranames
Description ParaNames is a multilingual parallel name resource consisting of 118 million names spanning 400 languages. The dataset was constructed using Wikidata as its source. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG).
Subsets -
Languages bug, ace, ilo, zlm, tam, bbc, nia, bjn, nrm, tet, pag, ceb, vie, tha, eng, dtp, kjp, bto, lao, min, tgl, por, ban, san, shn, war, nod, gor, bcl, cbk, mnw, jav, mya, mad, krj, khm, lus, btm, cps, pam, abs, hil, ind
Tasks Named Entity Recognition, Entity Linking, Word-level Translation, Word lists
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://github.com/bltlab/paranames
HF URL https://huggingface.co/datasets/imvladikon/paranames
Paper URL https://aclanthology.org/2022.sigtyp-1.15.pdf
mrqorib commented 3 months ago

self-assign

mrqorib commented 1 month ago

After checking the data and reading the paper, I'm unsure that this data is qualified for any of the Tasks mentioned in the datasheet. The dataset was (probably) created with the help of tools for those tasks, but not for those tasks. The problems are:

Another thing to consider to not include this to SeaCrowd is that this is a dataset with 400 languages, not exactly a dataset specific for Southeast Asian languages.

What do you guys think? If you guys think we still should add this to SeaCrowd, I guess we can call it a specific type of word list or word translation dataset and make a custom schema for it. I'm thinking of the following type of structure for each example

{
  'id': 0,
  'wikidata_id': 'Q45',
  'name': 'Portugal',
  'type': 'LOC',
  'name_origin': {
    'en': 'Portugal',
    'it': 'Portogallo',
    ...
    'pl': 'Portugalia',
  }
}