CocoLab-2022 / cnnlp-traditionallingustics-enhancement

MIT License
0 stars 0 forks source link

Identify major datasets #1

Open coco-lab-2022 opened 2 years ago

coco-lab-2022 commented 2 years ago

Please summarize the major datasets used in literature.

gezi-creator commented 2 years ago

Major Datasets on CWS:

Paper on the datasets:

  1. datasets: AS、PK、CITYU、MSR The Second International Chinese Word Segmentation Bakeoff(2005) Thomas Emerson paper: https://aclanthology.org/I05-3017.pdf

  2. datasets: CTB6 The penn chinese treebank: Phrase structure annotation of a large corpus(2005) Xue N, Xia F, Chiou F D, ... paper: https://www.coli.uni-saarland.de/~tania/CMGD/site/papers/the-penn-chinese-treebank-phrase-structure-annotation-of-a-large-corpus.pdf

  3. datasets: CITYU、CKIP、CTB、MSRA、NCC、PKU、SXU The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese Pos Tagging(2008) Guangjin Jin, Xiao Chen paper: https://aclanthology.org/I08-4010.pdf

  4. datasets: WTB Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach(2014) William Yang Wang, Lingpeng Kong, Kathryn Mazaitis, William W. Cohen paper: https://aclanthology.org/D14-1122.pdf

  5. datasets: ZX Type-supervised domain adaptation for joint segmentation and pos-tagging (2014) Meishan Zhang, Yue Zhang , Wanxiang Che , Ting Liu paper: https://aclanthology.org/E14-1062.pdf

  6. datasets: UD CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies(2017) Daniel Zeman1 , Martin Popel1,... paper: https://iris.unito.it/retrieve/handle/2318/1652589/371422/K17-3001.pdf

  7. datasets: MCWS More than Text: Multi-modal Chinese Word Segmentation. ACL(2021) Dong Zhang, Zheng Hu, Shoushan Li , Hanqian Wu, Qiaoming Zhu, Guodong Zhou method: Proposes a new dataset for multi-modal Chinese word segmentation (MCWS), datasets: https://github.com/MANLP-suda/MCWS paper: https://aclanthology.org/2021.acl-short.70.pdf