CocoLab-2022 / cnnlp-traditionallingustics-enhancement

MIT License
0 stars 0 forks source link

Identify major datasets #1

Open coco-lab-2022 opened 2 years ago

coco-lab-2022 commented 2 years ago

Please summarize the major datasets used in literature.

gezi-creator commented 2 years ago

Major Datasets on CWS:

Paper on the datasets:

  1. datasets: AS、PK、CITYU、MSR The Second International Chinese Word Segmentation Bakeoff(2005) Thomas Emerson paper:

  2. datasets: CTB6 The penn chinese treebank: Phrase structure annotation of a large corpus(2005) Xue N, Xia F, Chiou F D, ... paper:

  3. datasets: CITYU、CKIP、CTB、MSRA、NCC、PKU、SXU The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese Pos Tagging(2008) Guangjin Jin, Xiao Chen paper:

  4. datasets: WTB Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach(2014) William Yang Wang, Lingpeng Kong, Kathryn Mazaitis, William W. Cohen paper:

  5. datasets: ZX Type-supervised domain adaptation for joint segmentation and pos-tagging (2014) Meishan Zhang, Yue Zhang , Wanxiang Che , Ting Liu paper:

  6. datasets: UD CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies(2017) Daniel Zeman1 , Martin Popel1,... paper:

  7. datasets: MCWS More than Text: Multi-modal Chinese Word Segmentation. ACL(2021) Dong Zhang, Zheng Hu, Shoushan Li , Hanqian Wu, Qiaoming Zhu, Guodong Zhou method: Proposes a new dataset for multi-modal Chinese word segmentation (MCWS), datasets: paper: