SamuelCahyawijaya commented 3 months ago

Dataloader name: paranames/paranames.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?paranames

Dataset	paranames
Description	ParaNames is a multilingual parallel name resource consisting of 118 million names spanning 400 languages. The dataset was constructed using Wikidata as its source. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG).
Subsets	-
Languages	bug, ace, ilo, zlm, tam, bbc, nia, bjn, nrm, tet, pag, ceb, vie, tha, eng, dtp, kjp, bto, lao, min, tgl, por, ban, san, shn, war, nod, gor, bcl, cbk, mnw, jav, mya, mad, krj, khm, lus, btm, cps, pam, abs, hil, ind
Tasks	Named Entity Recognition, Entity Linking, Word-level Translation, Word lists
License	Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage	https://github.com/bltlab/paranames
HF URL	https://huggingface.co/datasets/imvladikon/paranames
Paper URL	https://aclanthology.org/2022.sigtyp-1.15.pdf

mrqorib commented 3 months ago

self-assign

mrqorib commented 1 month ago

After checking the data and reading the paper, I'm unsure that this data is qualified for any of the Tasks mentioned in the datasheet. The dataset was (probably) created with the help of tools for those tasks, but not for those tasks. The problems are:

Named entity recognition: this dataset is a list of named entities in different languages, but if we formulate it as an NER dataset by pairing the entities with the original sentence from Wikipedia, I'm not sure that all named entities in that sentence are tagged. This is because the dataset is generated by taking a list of entities from Wikidata entity records instead of tagging the sentences from Wikipedia. In addition, the source sentences of the named entities, which will be from Wikipedia articles of different languages, may not be aligned.
Entity linking: The dataset does not provide the link and list of possible referred articles, unless we want to consider the whole Wikipedia.
Word-level translation: The domain is too confined, only named entities.
Word list: The words in the dataset do not represent a language.

Another thing to consider to not include this to SeaCrowd is that this is a dataset with 400 languages, not exactly a dataset specific for Southeast Asian languages.

What do you guys think? If you guys think we still should add this to SeaCrowd, I guess we can call it a specific type of word list or word translation dataset and make a custom schema for it. I'm thinking of the following type of structure for each example

{
  'id': 0,
  'wikidata_id': 'Q45',
  'name': 'Portugal',
  'type': 'LOC',
  'name_origin': {
    'en': 'Portugal',
    'it': 'Portogallo',
    ...
    'pl': 'Portugalia',
  }
}

SEACrowd / seacrowd-datahub

Create dataset loader for ParaNames #515

self-assign