WWW-2019-Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation

BrambleXu commented 5 years ago

一句话总结：

针对CNER，通过CNN-LSTM-CRF解决学习context问题。并与分词模型进行联合训练。还提出了通过替换entity的数据增强方法。

CNER比较难是因为中文里的entity是highly context dependent。而且中文不像英语有空格可以断开单词，所以判断entity的boundary非常困难。另外，CNER很多领域里的训练数据是不够的，标注太花时间了本文要点：

使用CNN-LSTM-CRF来捕捉both local and long-distance contexts for CNER

提出了一个框架来联合训练CNER and word segmentation models，为了提高CNER模型判断entity boundaries的能力

基于有标签的数据，自动生成为伪样本，充实训练数据在两个数据集上的表现说明了这个模型在训练数据不足的情况下，CNER能有很好的效果

资源：

pdf
[code](
[paper-with-code](

关键字：

dataset:
Named Entity Recognition; Word Segmentation; Neural Network
Chinese word segmentation (CWS) models

笔记：

3.1 CNN-LSTM-CRF Architecture for CNER

模型图：

3.3 Pseudo Labeled Data Generation

因为训练数据不足，所以创造新样本很有必要。

动机：Our method is motivated by the observation that if an entity name in a sentence is replaced by another entity name with the same concept, then the new sentence is usually correct in grammar and semantics

“李刚在阿里工作“ 变为“王小超在谷歌工作"。

对于NER标签的变化为：

“B-PER/I-PER/O/B-ORG/I-ORG/O/O” -> “B-PER/I-PER/I-PER/O/B-ORG/I-ORG/O/O”

对于CWS (分词)标签的变化为： “B/I/B/B/I/B/B/I" -> “B/I/I/B/B/I/B/B/I"

具体做法，先从样本中提取所有的entity name（EN），然后随机选择一个有标记的句子，从EN中随机选择同样concept的entity来代替。这样得到的pseudo sentence的NER标签和CWS标签都可以自动获得。

4.1 Datasets and Experimental Settings

数据集：Bakeoff-3，Bakeoff-4

4.2 Performance Evaluation

pseudo的数量和real labeled样本的数量是一样的。

We conducted experiments on different ratios (i.e., 5%, 25% and 100%) of training data to test the performance of different methods with insufficient and sufficient labeled samples. For those methods which involve pseudo labeled samples, the number of pseudo samples is the same with the real labeled samples. The experimental results on the Bakeoff-3 and Bakeoff-4 datasets are shown in Tables 2 and 3 respectively. We have several findings from the results.