arXiv-2019/8-Simplify the Usage of Lexicon in Chinese NER

Summary:

因为 Lattice-LSTM #279 运算效率太低。所以这篇文章希望更有效地把lexicon information导入到character representation里。

Resource:

pdf, 笔记1，笔记2
code
[paper-with-code](

Paper information:

Author: Minlong Peng, Ruotian Ma, Qi Zhang, Xuanjing Huang , Fudan University
Dataset:
keywords:

Notes:

作者分析了 Lattice-LSTM的优缺点。

优点：

其保存了所有可能匹配的单词。
其可以将预训练好的word embedding嵌入到系统中。
模型具有attention机制自动给单词赋权重。

所以这篇文章的想法是在保持上面优点的情况下，舍弃原来的LSTM模型。作者提出的方法是提出了一种新的编码方式。

一个句子s中的每一个字符c，都有对应的4个word sets。这个word sets是通过“BMES”4个标签来标记的。

B(c)集合：包含所有以字符c为起始的词
M(c)集合：包含所有以字符c为中间字的词
E(c)集合：包含所有以字符c为结束字的词
S(c)集合：c单独组成一个词

如果集合为空则成员为None。

Consider the sentence s = {c1, · · · , c5} and suppose that {c1, c2}, {c1, c2, c3}, {c2, c3, c4}, and {c2, c3, c4, c5} match the lexicon. Then, for c2, B(c2) = {{c2, c3, c4}, {c2, c3, c4, c5}}, M(c2) = {{c1, c2, c3}}, E(c2) = {{c1, c2}}, and S(c2) = {NONE}.

这里例子里的B(c2) = {{c2, c3, c4}, {c2, c3, c4, c5}}，B(“南”）= {南京市，南京大桥}。

V^s的部分是一个map 函数，把一个word set变成fixed-dimensional vector。这里引入了mean-pooling算法来表示word set S的vector representation:

但是mean-pooling的效果并不好。Lattice-LSTM里使用了dynamical weighting algorithm，为了保证速度，这里才用的weighting 方法是 the frequency of the word as an indication of its weight. The basic idea beneath this algorithm is that the more times a character sequence occurs in the data, the more likely it is a word. Note that, the frequency of a word is a static value and can be obtained offline. This can greatly accelerate the calculation of the weight of each word (e.g., using a lookup table).

这里还专门提高infrequent words的权重

Model Graph:

Result:：

Thoughts:

Next Reading:

BrambleXu / knowledge-graph-learning

arXiv-2019/8-Simplify the Usage of Lexicon in Chinese NER #280