Open SamuelCahyawijaya opened 3 months ago
After checking the data and reading the paper, I'm unsure that this data is qualified for any of the Tasks mentioned in the datasheet. The dataset was (probably) created with the help of tools for those tasks, but not for those tasks. The problems are:
Another thing to consider to not include this to SeaCrowd is that this is a dataset with 400 languages, not exactly a dataset specific for Southeast Asian languages.
What do you guys think? If you guys think we still should add this to SeaCrowd, I guess we can call it a specific type of word list or word translation dataset and make a custom schema for it. I'm thinking of the following type of structure for each example
{
'id': 0,
'wikidata_id': 'Q45',
'name': 'Portugal',
'type': 'LOC',
'name_origin': {
'en': 'Portugal',
'it': 'Portogallo',
...
'pl': 'Portugalia',
}
}
Dataloader name:
paranames/paranames.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?paranames