SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for idner-news-2k #534

Closed SamuelCahyawijaya closed 4 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: idner_news_2k/idner_news_2k.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?idner_news_2k

Dataset idner_news_2k
Description A dataset of Indonesian News for Named-Entity Recognition task. This dataset was previously provided by Syaifudin & Nurwidyantoro (2016) (https://github.com/yusufsyaifudin/Indonesia-ner). We manually re-annotated the dataset with a more standardized NER tags. We split this dataset into three files, namely train.txt, dev.txt, and test.txt. Each file consists of three columns which are Tokens, PoS Tag, and NER Tag respectively. The format is following CoNLL dataset which split each token into one line and each sentence is separated by one empty line. For the NER tag, we use the IOB format as illustrated in the example below. In terms of PoS tag, we tagged the data using UDPipe (http://ufal.mff.cuni.cz/udpipe), a pipeline for tokenization, tagging, lemmatization and dependency parsing whose model is trained on UD Treebanks.
Subsets -
Languages ind
Tasks Named Entity Recognition
License MIT (mit)
Homepage https://github.com/khairunnisaor/idner-news-2k
HF URL -
Paper URL https://aclanthology.org/2020.aacl-srw.10/
R-Damanhuri commented 5 months ago

self-assign